Introduction

We have the data of a mobile games users. We have 30k users’ data and ~25 features along with our response column “churn”. Our aim is to analyze the data and use machine learning algorithms to fit a model that can predict the churn behavior of users on another test set that consists of 5000 users other than the 30k we got, but that we don’t know the churn behavior of.

I will do exploratory data analysis where I will look at the data structure, correlations, visualizations and try to get insights from the data.

With the insights that I got, I will do feature engineering to change the features with data manipulation or generate additional features by doing operations on the features that I already have.

Lastly, I will fit models to different data sets I created, and will use the best combination (model and data set) in terms of accuracy (I will split my training data to train and test set to calculate this) to do predictions on the actual test data (which we don’t know the churn behavior of).

If I succeed, this model can be used to predict the churn behavior of users that sign up, and take business actions accordingly (like sending some special campaigns to those who we predict that will churn). Other than this, our exploratory data analysis insights can be used to improve the game in a way that will decrease the churn ratio of users by improving the game mechanics or features.

1. Exploratory Data Analysis

First, we will import our data to R and review it.

x <- read.csv("ml_project_2022_churn_train.csv")
str(x)
## 'data.frame':    30000 obs. of  26 variables:
##  $ X                : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ X_id             : int  88464659 88466479 88468912 88471351 88473161 88474849 88476408 88478719 88480762 88482779 ...
##  $ os               : chr  "android" "android" "android" "android" ...
##  $ country          : chr  "IT" "BR" "RO" "LT" ...
##  $ device_brand     : chr  "samsung" "Xiaomi" "HUAWEI" "HUAWEI" ...
##  $ device_model     : chr  "SM-A125F" "Redmi 8" "CLT-L29" "SNE-LX1" ...
##  $ session_cnt      : int  7 1 1 2 1 3 1 3 1 8 ...
##  $ session_length   : int  13162 1809 2141 1298 666 3514 503 1138 520 1888 ...
##  $ lang             : chr  "IT" "BR" "RO" "LT" ...
##  $ max_lvl_no       : int  13 9 28 14 12 20 9 23 9 1 ...
##  $ gameplay_duration: int  1088 1898 1685 1221 532 3018 393 747 382 1435 ...
##  $ bonus_cnt        : int  6 9 8 2 4 23 3 3 0 15 ...
##  $ hint1_cnt        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hint2_cnt        : int  2 1 0 0 1 0 0 0 0 0 ...
##  $ hint3_cnt        : int  0 0 0 0 0 0 0 0 0 2 ...
##  $ repeat_cnt       : int  7 10 21 7 10 30 4 1 1 38 ...
##  $ gold_cnt         : int  418 25 990 1360 1367 585 620 2141 665 500 ...
##  $ claim_gold_cnt   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ banner_cnt       : int  17 38 0 49 23 154 16 33 9 14 ...
##  $ is_cnt           : int  1 0 16 2 0 7 0 11 0 2 ...
##  $ rv_cnt           : int  0 1 1 1 1 3 0 1 1 4 ...
##  $ churn            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ multi_lang       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ max_ses_length   : int  12451 1809 2141 1248 666 1817 503 449 520 605 ...
##  $ median_ses_length: int  100 1809 2141 649 666 1609 503 372 520 193 ...
##  $ avg_ses_length   : int  1880 1809 2141 649 666 1171 503 379 520 236 ...

We can see that (other than the response variable), our data consists of 7 categorical and 18 numerical variables (we don’t include the first column X which is the row number and not really a feature) and we have 30000 observations.

We can remove X from our dataset since it is not a feature.

x <- x[,!names(x) %in% c("X")]


We will start our EDA with the descriptive (univariate) analysis of our features one by one.

1.1 Descriptive (Univariate) & Quantitative Correlation Analysis

Missing Variables

which(is.na(x))
## integer(0)
sum(is.na(x))
## [1] 0

There are no missing values on our data whatsoever. This is good since we won’t have to worry about them.

Response Column

Our response column is “churn”. We calculate the ratio of churners vs non churners:

sum(x$churn)/30000
## [1] 0.6269333

We find that ~%63 of our observations are churners.

Categorical Features

To analyze categorical features, we first factorize our categorical features from chr or logi to Factor. We also factorize our response column “churn” even if it is numerical because it is actually a binary categorical variable consisting of 2 levels (0 = not churned and 1 = churned).

x <- as.data.frame(unclass(x), stringsAsFactors = TRUE) #factorize chr's
x$multi_lang <- as.factor(x$multi_lang)
x$churn <- as.factor(x$churn) #factorize churn
str(x)
## 'data.frame':    30000 obs. of  25 variables:
##  $ X_id             : int  88464659 88466479 88468912 88471351 88473161 88474849 88476408 88478719 88480762 88482779 ...
##  $ os               : Factor w/ 1 level "android": 1 1 1 1 1 1 1 1 1 1 ...
##  $ country          : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
##  $ device_brand     : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
##  $ device_model     : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
##  $ session_cnt      : int  7 1 1 2 1 3 1 3 1 8 ...
##  $ session_length   : int  13162 1809 2141 1298 666 3514 503 1138 520 1888 ...
##  $ lang             : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
##  $ max_lvl_no       : int  13 9 28 14 12 20 9 23 9 1 ...
##  $ gameplay_duration: int  1088 1898 1685 1221 532 3018 393 747 382 1435 ...
##  $ bonus_cnt        : int  6 9 8 2 4 23 3 3 0 15 ...
##  $ hint1_cnt        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hint2_cnt        : int  2 1 0 0 1 0 0 0 0 0 ...
##  $ hint3_cnt        : int  0 0 0 0 0 0 0 0 0 2 ...
##  $ repeat_cnt       : int  7 10 21 7 10 30 4 1 1 38 ...
##  $ gold_cnt         : int  418 25 990 1360 1367 585 620 2141 665 500 ...
##  $ claim_gold_cnt   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ banner_cnt       : int  17 38 0 49 23 154 16 33 9 14 ...
##  $ is_cnt           : int  1 0 16 2 0 7 0 11 0 2 ...
##  $ rv_cnt           : int  0 1 1 1 1 3 0 1 1 4 ...
##  $ churn            : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ multi_lang       : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_ses_length   : int  12451 1809 2141 1248 666 1817 503 449 520 605 ...
##  $ median_ses_length: int  100 1809 2141 649 666 1609 503 372 520 193 ...
##  $ avg_ses_length   : int  1880 1809 2141 649 666 1171 503 379 520 236 ...

We will draw bar graphs of our categorical variables with respect to churn to see the distributions using ggplot.

library(ggplot2)
categorical_names <- names(Filter(is.factor, x))

for (i in categorical_names) {
  print(ggplot(x, aes(x[,i])) + geom_bar(aes(fill = churn), position = "dodge") + labs(x = i))
}

At first glance, we can see that we have a non informative (os has only one level) variable, and variables with low entropy or class imbalance problems. Now let’s analyze them one by one.

OS

OS column only has one level, which means that it is a non informative feature which has no importance whatsoever predicting churn. This means that we can remove this variable from our dataset to reduce complexity and computation cost while having no drawbacks.

x <- x[,!names(x) %in% c("os")]

Country

For many of its attributes, it seems that the country feature suffers from the low entropy problem, because it has lots of attributes that occurs on few observations. From the graph, we can say that ~15 attributes of 72 covers most of the data.

We can deal with this problem by combining the categories with few observations as another category and call it “Other” (1), or find another highly correlated categorical feature and use only that feature like language, removing country altogether (2).

Other than this problem, it seems that the churn ratios differ among countries, which means that this variable can give us valuable information predicting churn.

Language

Language seems like another version of country that has less levels and less significant low entropy/class imbalance problems. Again ~15 attributes out of 28(significantly lesser than country) cover most of the observations.

If country and language are highly correlated like one would expect, we can choose one of them and remove the other feature.

Device Brand

We see the same problem on country (low entropy and class imbalance) but in an even bigger way in this feature.

We also do not see any significant difference on churn ratios between device brands, but we can also check this by simply drawing churn ratio graphs for the most significant levels of this feature.

brand_names_150 <- names(which(table(x$device_brand) > 150))
brand_names_150
##  [1] ""           "HMD Global" "HUAWEI"     "LENOVO"     "LGE"       
##  [6] "motorola"   "OnePlus"    "OPPO"       "realme"     "samsung"   
## [11] "TCL"        "Xiaomi"
library(data.table)
xvdt <- data.table(x)

brand150 <- xvdt[device_brand %in% brand_names_150]
brand150
##            X_id country device_brand    device_model session_cnt session_length
##     1: 88464659      IT      samsung        SM-A125F           7          13162
##     2: 88466479      BR       Xiaomi         Redmi 8           1           1809
##     3: 88468912      RO       HUAWEI         CLT-L29           1           2141
##     4: 88471351      LT       HUAWEI         SNE-LX1           2           1298
##     5: 88473161      RO       HUAWEI         LYA-L29           1            666
##    ---                                                                         
## 29074: 95099191      IT       Xiaomi   Redmi Note 8T           4          10668
## 29075: 95105634      BG       realme         RMX3263           2           2104
## 29076: 95111824      BR       Xiaomi Mi Note 10 Lite           1            397
## 29077: 95117734      GR      samsung        SM-A217F           3           1548
## 29078: 95122431      GR      samsung        SM-G975F           3            390
##        lang max_lvl_no gameplay_duration bonus_cnt hint1_cnt hint2_cnt
##     1:   IT         13              1088         6         0         2
##     2:   BR          9              1898         9         0         1
##     3:   RO         28              1685         8         0         0
##     4:   LT         14              1221         2         0         0
##     5:   RO         12               532         4         0         1
##    ---                                                                
## 29074:   IT         26              8390        52         0         4
## 29075:   BU         17              2046         2         0         5
## 29076:   BR          4               408         3         0         4
## 29077:   GR         20              1248         3         0         1
## 29078:   GR          8               310         0         0         0
##        hint3_cnt repeat_cnt gold_cnt claim_gold_cnt banner_cnt is_cnt rv_cnt
##     1:         0          7      418              0         17      1      0
##     2:         0         10       25              0         38      0      1
##     3:         0         21      990              0          0     16      1
##     4:         0          7     1360              0         49      2      1
##     5:         0         10     1367              0         23      0      1
##    ---                                                                      
## 29074:         6         48     1585              0        346     38     20
## 29075:         0         12       72              0         64      5      2
## 29076:         0          3      545              0          0      0      1
## 29077:         0          3      115              0         48      5      2
## 29078:         0          1      605              0          8      0      0
##        churn multi_lang max_ses_length median_ses_length avg_ses_length
##     1:     1      FALSE          12451               100           1880
##     2:     1      FALSE           1809              1809           1809
##     3:     1      FALSE           2141              2141           2141
##     4:     1      FALSE           1248               649            649
##     5:     1      FALSE            666               666            666
##    ---                                                                 
## 29074:     1      FALSE           3846              2992           2667
## 29075:     0      FALSE           1784              1052           1052
## 29076:     1      FALSE            397               397            397
## 29077:     0      FALSE            621               582            516
## 29078:     0      FALSE            388                 2            130
library(ggplot2)
library(forcats) #we will use fct_infreq to plot the brands in descending order by their counts

ggplot(brand150, aes(fct_infreq(device_brand))) + geom_bar(aes(fill = churn), position = "dodge")

ggplot(brand150, aes(fct_infreq(device_brand))) + geom_bar(aes(fill = churn), position = "fill")

We can see that when we only take the brands that appear on more than 150 observations into account we reduce the level count from 113 to 12, by overlooking ony 922 observations.

We see differentiation between churn ratios of different brands, which means that this feature may give us valuable information on churn

We also see a level with name ““, which is empty. This might be the users devices that we don’t have access to.

Device Model

Device model suffers from high cardinality since it has too many levels, so we apply the same operation to it.

model_names_100 <- names(which(table(x$device_model) > 100))
model_names_100
##  [1] ""                 "ANE-LX1"          "CLT-L29"          "DUB-LX1"         
##  [5] "ELE-L29"          "EML-L29"          "FIG-LX1"          "M2003J15SC"      
##  [9] "M2004J19C"        "M2006C3MNG"       "M2007J20CG"       "M2010J19SY"      
## [13] "M2101K6G"         "M2101K7AG"        "M2101K7BNY"       "M2102J20SG"      
## [17] "M2103K19G"        "MAR-LX1A"         "MRD-LX1"          "POT-LX1"         
## [21] "Redmi 8"          "Redmi Note 7"     "Redmi Note 8"     "Redmi Note 8 Pro"
## [25] "Redmi Note 8T"    "Redmi Note 9 Pro" "Redmi Note 9S"    "SM-A105FN"       
## [29] "SM-A125F"         "SM-A127F"         "SM-A202F"         "SM-A217F"        
## [33] "SM-A307FN"        "SM-A315G"         "SM-A325F"         "SM-A326B"        
## [37] "SM-A405FN"        "SM-A415F"         "SM-A505FN"        "SM-A515F"        
## [41] "SM-A520F"         "SM-A525F"         "SM-A528B"         "SM-A600FN"       
## [45] "SM-A705FN"        "SM-A715F"         "SM-A725F"         "SM-A750FN"       
## [49] "SM-G780F"         "SM-G780G"         "SM-G781B"         "SM-G950F"        
## [53] "SM-G955F"         "SM-G960F"         "SM-G965F"         "SM-G970F"        
## [57] "SM-G973F"         "SM-G975F"         "SM-G980F"         "SM-G991B"        
## [61] "SM-G996B"         "SM-J415FN"        "SM-J600FN"        "SNE-LX1"         
## [65] "STK-LX1"          "VOG-L29"          "YAL-L21"
library(data.table)
xvdt <- data.table(x)

model100 <- xvdt[device_model %in% model_names_100]
model100
##            X_id country device_brand  device_model session_cnt session_length
##     1: 88464659      IT      samsung      SM-A125F           7          13162
##     2: 88466479      BR       Xiaomi       Redmi 8           1           1809
##     3: 88468912      RO       HUAWEI       CLT-L29           1           2141
##     4: 88471351      LT       HUAWEI       SNE-LX1           2           1298
##     5: 88476408      FR       Xiaomi  Redmi Note 8           1            503
##    ---                                                                       
## 17122: 95086724      PL       HUAWEI       POT-LX1          34          45765
## 17123: 95092582      BG      samsung      SM-A125F          10           8133
## 17124: 95099191      IT       Xiaomi Redmi Note 8T           4          10668
## 17125: 95117734      GR      samsung      SM-A217F           3           1548
## 17126: 95122431      GR      samsung      SM-G975F           3            390
##        lang max_lvl_no gameplay_duration bonus_cnt hint1_cnt hint2_cnt
##     1:   IT         13              1088         6         0         2
##     2:   BR          9              1898         9         0         1
##     3:   RO         28              1685         8         0         0
##     4:   LT         14              1221         2         0         0
##     5:   FR          9               393         3         0         0
##    ---                                                                
## 17122:   PL          2             25228       133         0        25
## 17123:   BU         23             10833         1         0         1
## 17124:   IT         26              8390        52         0         4
## 17125:   GR         20              1248         3         0         1
## 17126:   GR          8               310         0         0         0
##        hint3_cnt repeat_cnt gold_cnt claim_gold_cnt banner_cnt is_cnt rv_cnt
##     1:         0          7      418              0         17      1      0
##     2:         0         10       25              0         38      0      1
##     3:         0         21      990              0          0     16      1
##     4:         0          7     1360              0         49      2      1
##     5:         0          4      620              0         16      0      0
##    ---                                                                      
## 17122:        33        242      515              0        302     75     45
## 17123:         0         39     1870              0        136      8      2
## 17124:         6         48     1585              0        346     38     20
## 17125:         0          3      115              0         48      5      2
## 17126:         0          1      605              0          8      0      0
##        churn multi_lang max_ses_length median_ses_length avg_ses_length
##     1:     1      FALSE          12451               100           1880
##     2:     1      FALSE           1809              1809           1809
##     3:     1      FALSE           2141              2141           2141
##     4:     1      FALSE           1248               649            649
##     5:     1      FALSE            503               503            503
##    ---                                                                 
## 17122:     1      FALSE          18593               400           1346
## 17123:     0      FALSE           3407               231            813
## 17124:     1      FALSE           3846              2992           2667
## 17125:     0      FALSE            621               582            516
## 17126:     0      FALSE            388                 2            130
library(ggplot2)
library(forcats) #we will use fct_infreq to plot the brands in descending order by their counts

ggplot(model100, aes(fct_infreq(device_model))) + geom_bar(aes(fill = churn), position = "dodge")

ggplot(model100, aes(fct_infreq(device_model))) + geom_bar(aes(fill = churn), position = "fill")

When we oversee the models that appear on less than 100 observations, the level count is almost 70 which is still too big and we overlook nearly 13000 observations, which is also too big. This means that we can’t handle the high cardinality problem of device_model by simply assigning some of the attributes to another attribute “other”.

We also don’t see a relationship between brands of the models and the churn ratio from the graphs.For example, some of the Samsung models have close to average churn ratios, while some of them have way smaller. Device brand feature will still cover some of the information that device model gives us, but we can’t hope to gain almost all the information from device brand that we would get from device model since they won’t be as correlated as language and country.

With these information, we can either remove the device_model column completely (1), or do some feature engineering and only take the models that have way larger or smaller churn ratios than average, and use them as binary (yes or no) categorical features (2). For example, we can take model x which has churn ratio of 0.40 (way smaller than average) and include it as another binary categorical feature to our model as x_yes (attributes = yes (model is x) or no (model is not x)).

Multi Language

We see that the feature heavily suffers from class imbalance as most of the users do not use multiple languages as expected.

(1) Seems like users who do not use multiple languages does not give us any information since the ratio is very close to average as expected.

  1. On the other hand users who use multiple language seems to have a way higher ratio than average (%63).

    Let’s analyze this.
table(x$multi_lang)
## 
## FALSE  TRUE 
## 29359   641
ggplot(x, aes(multi_lang)) + geom_bar(aes(fill = churn), position = "fill")

We can see from the plot that users who use multi_lang (641 observations) have way higher than average churn ratio (~%87).

Numerical Features

x_numeric <- x[, sapply(x, is.numeric)]

library(sjmisc)
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
descr(x_numeric, show = c("mean", "sd", "md", "trimmed", "range", "iqr", "skew"))
## 
## ## Basic descriptive statistics
## 
##                var        mean         sd       md     trimmed
##               X_id 91090972.38 2069047.39 90726232 90940005.12
##        session_cnt        8.58      43.19        3        3.55
##     session_length     9082.73   65571.48     1926     2764.35
##         max_lvl_no       21.71      55.82       13       16.12
##  gameplay_duration     7409.82  214476.01     1453     2091.74
##          bonus_cnt       40.14     425.68        6        9.63
##          hint1_cnt        0.46      20.46        0        0.00
##          hint2_cnt        2.22       8.84        0        0.95
##          hint3_cnt       11.43     193.51        0        0.48
##         repeat_cnt       71.87     543.95       12       19.02
##           gold_cnt      843.29    2447.10      662      724.81
##     claim_gold_cnt        0.00       0.00        0        0.00
##         banner_cnt      165.05     742.50       46       68.74
##             is_cnt       22.82     104.07        5        9.02
##             rv_cnt       10.39     131.55        2        2.44
##     max_ses_length     3262.95   37949.28     1191     1481.93
##  median_ses_length      865.83    1279.61      535      648.11
##     avg_ses_length     1097.15    4729.89      656      784.54
##                        range        iqr   skew
##  6661078 (88463874-95124952) 3713191.00   0.46
##                2294 (1-2295)       4.00  22.70
##         5099724 (12-5099736)    3616.25  34.57
##                3615 (1-3616)      18.00  34.26
##        36562745 (4-36562749)    2802.25 165.19
##              35727 (0-35727)      15.00  43.89
##                2857 (0-2857)       0.00 105.42
##                  563 (0-563)       2.00  26.27
##              22382 (0-22382)       1.00  71.63
##              30460 (0-30460)      29.00  30.05
##            217085 (0-217085)     570.00  67.86
##                      0 (0-0)       0.00    NaN
##              30602 (0-30602)     103.00  18.83
##                8774 (0-8774)      18.00  31.51
##              13504 (0-13504)       4.00  62.40
##         5095045 (12-5095057)    1686.25  93.45
##              86854 (0-86854)     730.00  16.54
##           690590 (12-690602)     815.00 110.69
numeric_names <- names(Filter(is.numeric, x))

for (i in numeric_names) {
  print(ggplot(x, aes(x[,i])) + geom_histogram(bins = 1000) + labs(x = i))
}

Because of the extreme outliers on most of the numeric features, it is impossible to gain any insight from the distributions. This is why we will first remove these outliers.

Other than this, we see that claim_gold_cnt only has one value “0”. This means that we can remove this column.

We experiment with outliers a bit:

x <- x[,!names(x) %in% c("claim_gold_cnt")]
for (i in numeric_names) {
  print(i)
x_no_outliers <- x[,!names(x) %in% c(i)]
numeric_no_outliers <- names(Filter(is.numeric, x_no_outliers))

for(i in numeric_no_outliers){
  quartiles <- quantile(x[,i], probs=c(.25, .75), na.rm = FALSE)
  IQR <- IQR(x[,i])
  
  Lower <- quartiles[1] - 1.5*IQR
  Upper <- quartiles[2] + 1.5*IQR
  
  x_no_outliers <- subset(x_no_outliers, x_no_outliers[,i] > Lower & x_no_outliers[,i] < Upper)
}

print(dim(x_no_outliers))
}
## [1] "X_id"
## [1]  0 22
## [1] "session_cnt"
## [1]  0 22
## [1] "session_length"
## [1]  0 22
## [1] "max_lvl_no"
## [1]  0 22
## [1] "gameplay_duration"
## [1]  0 22
## [1] "bonus_cnt"
## [1]  0 22
## [1] "hint1_cnt"
## [1] 17775    22
## [1] "hint2_cnt"
## [1]  0 22
## [1] "hint3_cnt"
## [1]  0 22
## [1] "repeat_cnt"
## [1]  0 22
## [1] "gold_cnt"
## [1]  0 22
## [1] "claim_gold_cnt"
## [1]  0 23
## [1] "banner_cnt"
## [1]  0 22
## [1] "is_cnt"
## [1]  0 22
## [1] "rv_cnt"
## [1]  0 22
## [1] "max_ses_length"
## [1]  0 22
## [1] "median_ses_length"
## [1]  0 22
## [1] "avg_ses_length"
## [1]  0 22

We know that the IQR of the hint1_cnt is 0 from the previous statistics summary on numerical variables table. We write a code to take out all the outliers of all columns except one of them, and do this for all columns.

We find that (as expected because the (IQR = 0) when we also take out the hint1_cnt outliers, it takes out all of the rows. When we take all outliers out but the hint1_cnt’s outliers, we got ~18k rows left.

x <- x[,!names(x) %in% c("claim_gold_cnt")]
x_no_outliers <- x
numeric_no_outliers <- names(Filter(is.numeric, x_no_outliers))

for(i in numeric_no_outliers){
  quartiles <- quantile(x[,i], probs=c(.25, .75), na.rm = FALSE)
  IQR <- IQR(x[,i])
  
  Lower <- quartiles[1] - 1.5*IQR
  Upper <- quartiles[2] + 1.5*IQR
  
  x_no_outliers <- subset(x, x[,i] > Lower & x[,i] < Upper)

print(i)
print(dim(x_no_outliers))
}
## [1] "X_id"
## [1] 30000    23
## [1] "session_cnt"
## [1] 27032    23
## [1] "session_length"
## [1] 26713    23
## [1] "max_lvl_no"
## [1] 27906    23
## [1] "gameplay_duration"
## [1] 26824    23
## [1] "bonus_cnt"
## [1] 26738    23
## [1] "hint1_cnt"
## [1]  0 23
## [1] "hint2_cnt"
## [1] 26036    23
## [1] "hint3_cnt"
## [1] 25311    23
## [1] "repeat_cnt"
## [1] 26591    23
## [1] "gold_cnt"
## [1] 28510    23
## [1] "banner_cnt"
## [1] 27011    23
## [1] "is_cnt"
## [1] 27077    23
## [1] "rv_cnt"
## [1] 26569    23
## [1] "max_ses_length"
## [1] 27448    23
## [1] "median_ses_length"
## [1] 27579    23
## [1] "avg_ses_length"
## [1] 27619    23

Now we take outliers of columns one by one and see no problems other than hint1_cnt.

x <- x[,!names(x) %in% c("claim_gold_cnt")]
x_hint1removed <- x[,!names(x) %in% c("hint1_cnt")]
x_no_outliers <- x_hint1removed
numeric_no_outliers <- names(Filter(is.numeric, x_no_outliers))

for(i in numeric_no_outliers){
  quartiles <- quantile(x[,i], probs=c(.25, .75), na.rm = FALSE)
  IQR <- IQR(x[,i])
  
  Lower <- quartiles[1] - 10*IQR
  Upper <- quartiles[2] + 10*IQR
  
  x_no_outliers <- subset(x_no_outliers, x_no_outliers[,i] > Lower & x_no_outliers[,i] < Upper)

print(i)
print(dim(x_no_outliers))
print(ggplot(x_no_outliers, aes(x_no_outliers[,i])) + geom_histogram(bins = 200, fill = "purple") + labs(x = i))


}
## [1] "X_id"
## [1] 30000    22

## [1] "session_cnt"
## [1] 29292    22

## [1] "session_length"
## [1] 29019    22

## [1] "max_lvl_no"
## [1] 28980    22

## [1] "gameplay_duration"
## [1] 28918    22

## [1] "bonus_cnt"
## [1] 28792    22

## [1] "hint2_cnt"
## [1] 28678    22

## [1] "hint3_cnt"
## [1] 27127    22

## [1] "repeat_cnt"
## [1] 27096    22

## [1] "gold_cnt"
## [1] 27092    22

## [1] "banner_cnt"
## [1] 27088    22

## [1] "is_cnt"
## [1] 27083    22

## [1] "rv_cnt"
## [1] 27003    22

## [1] "max_ses_length"
## [1] 26886    22

## [1] "median_ses_length"
## [1] 26832    22

## [1] "avg_ses_length"
## [1] 26832    22

This is when we take all outliers cumulatively (except hint1_cnt).

When we do all of them with 1.5 iqr, we got ~18k rows left. I did it with 10 iqr to not lose as many data (we lost ~3.2k data which is significantly smaller than 12k) and remove the most extreme outliers at the same time and plotted again.

This can be useful for future models and feature engineering.

Now we draw the same plots without the outliers (one by one for all columns except hint1_cnt).

x <- x[,!names(x) %in% c("claim_gold_cnt")]
x_no_outliers <- x[,!names(x) %in% c("hint1_cnt")]
numeric_no_outliers <- names(Filter(is.numeric, x_no_outliers))

for(i in numeric_no_outliers){
  quartiles <- quantile(x[,i], probs=c(.25, .75), na.rm = FALSE)
  IQR <- IQR(x[,i])
  
  Lower <- quartiles[1] - 1.5*IQR
  Upper <- quartiles[2] + 1.5*IQR
  
  x_no_outliers <- subset(x, x[,i] > Lower & x[,i] < Upper)

print(i)
print(dim(x_no_outliers))
print(ggplot(x_no_outliers, aes(x_no_outliers[,i])) + geom_histogram(bins = 200, fill = "purple") + labs(x = i))
}
## [1] "X_id"
## [1] 30000    23

## [1] "session_cnt"
## [1] 27032    23

## [1] "session_length"
## [1] 26713    23

## [1] "max_lvl_no"
## [1] 27906    23

## [1] "gameplay_duration"
## [1] 26824    23

## [1] "bonus_cnt"
## [1] 26738    23

## [1] "hint2_cnt"
## [1] 26036    23

## [1] "hint3_cnt"
## [1] 25311    23

## [1] "repeat_cnt"
## [1] 26591    23

## [1] "gold_cnt"
## [1] 28510    23

## [1] "banner_cnt"
## [1] 27011    23

## [1] "is_cnt"
## [1] 27077    23

## [1] "rv_cnt"
## [1] 26569    23

## [1] "max_ses_length"
## [1] 27448    23

## [1] "median_ses_length"
## [1] 27579    23

## [1] "avg_ses_length"
## [1] 27619    23

  1. Almost all of our features’ distributions are right skewed.
  2. Hint3_cnt suffers heavily from class imbalance. Hint2_cnt also suffers but not as much as hint3_cnt
  3. It is possible to turn session_cnt, hint2_cnt, hint3_cnt, and rv_cnt to ordinal categorical variables.

Now we do a qualitative correlation analysis between churn and numerical features:

x <- x[,!names(x) %in% c("claim_gold_cnt")]
x_no_outliers <- x[,!names(x) %in% c("hint1_cnt")]
numeric_no_outliers <- names(Filter(is.numeric, x_no_outliers))

for(i in numeric_no_outliers){
  quartiles <- quantile(x[,i], probs=c(.25, .75), na.rm = FALSE)
  IQR <- IQR(x[,i])
  
  Lower <- quartiles[1] - 1.5*IQR
  Upper <- quartiles[2] + 1.5*IQR
  
  x_no_outliers <- subset(x, x[,i] > Lower & x[,i] < Upper)

print(i)
print(dim(x_no_outliers))
print(ggplot(x_no_outliers, aes(x_no_outliers[,i], color = churn)) + geom_histogram(bins = 1000) + labs(x = i) + facet_wrap(~churn))
}
## [1] "X_id"
## [1] 30000    23

## [1] "session_cnt"
## [1] 27032    23

## [1] "session_length"
## [1] 26713    23

## [1] "max_lvl_no"
## [1] 27906    23

## [1] "gameplay_duration"
## [1] 26824    23

## [1] "bonus_cnt"
## [1] 26738    23

## [1] "hint2_cnt"
## [1] 26036    23

## [1] "hint3_cnt"
## [1] 25311    23

## [1] "repeat_cnt"
## [1] 26591    23

## [1] "gold_cnt"
## [1] 28510    23

## [1] "banner_cnt"
## [1] 27011    23

## [1] "is_cnt"
## [1] 27077    23

## [1] "rv_cnt"
## [1] 26569    23

## [1] "max_ses_length"
## [1] 27448    23

## [1] "median_ses_length"
## [1] 27579    23

## [1] "avg_ses_length"
## [1] 27619    23

  1. We see that lower values of X_id has way bigger churn ratio. Earlier versions of the game being worse may cause this.

  2. Users who has one session has smaller churn ratio than average. This means that we can turn this feature into a binary categorical variable with attributes session cnt 1 or more.

  3. People who didn’t gain much levels on their account has bigger churn ratios (as one would expect, because they did not play the game as much), and as the level increases the churn ratio is way smaller. After 10th level, there is a strong downhill trend for churners, which means their chances of churning gets smaller and smaller. After 25th level, churn ratio becomes less than %50.

  4. One would expect gameplay duration to give valuable information like level count, but it seems that the churning behaviour isn’t strongly correlated with gameplay duration like max level count.

  5. People who use hint 3 have way bigger churn ratio than ones who don’t use it.

  6. There is a spike in churn ratio where people had ~500 gold and ~1000 gold. This could also be turned to a categorical variable with attributes gold 500 or 1000 = yes and no.

  7. It seems max_ses_length has a spike in churn ratio between ~250 and ~1500.

  8. As average session length increases, churn ratio decreases with it having its max values between ~250 and ~750.

1.2 Quantitative Correlation (Bivariate) Analysis

Categorical Variables

I will use Chi-square test and information gain for categorical variables.

Chi-square Test

categorical_names <- names(Filter(is.factor, x))
for (i in categorical_names) {
  print(i)
  print(chisq.test(x$churn, x[, i], correct=FALSE))
}
## [1] "country"
## Warning in chisq.test(x$churn, x[, i], correct = FALSE): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  x$churn and x[, i]
## X-squared = 937.8, df = 70, p-value < 2.2e-16
## 
## [1] "device_brand"
## Warning in chisq.test(x$churn, x[, i], correct = FALSE): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  x$churn and x[, i]
## X-squared = 537.56, df = 112, p-value < 2.2e-16
## 
## [1] "device_model"
## Warning in chisq.test(x$churn, x[, i], correct = FALSE): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  x$churn and x[, i]
## X-squared = 3854.2, df = 1719, p-value < 2.2e-16
## 
## [1] "lang"
## 
##  Pearson's Chi-squared test
## 
## data:  x$churn and x[, i]
## X-squared = 789.93, df = 27, p-value < 2.2e-16
## 
## [1] "churn"
## 
##  Pearson's Chi-squared test
## 
## data:  x$churn and x[, i]
## X-squared = 30000, df = 1, p-value < 2.2e-16
## 
## [1] "multi_lang"
## 
##  Pearson's Chi-squared test
## 
## data:  x$churn and x[, i]
## X-squared = 166.16, df = 1, p-value < 2.2e-16

All of our categorical variables have very small p values, meaning they are strongly correlated with the response column.

Information Gain

library(FSelectorRcpp)

x_categoricals <- x[, sapply(x, is.factor)]
information_gain(formula = churn~., data = x_categoricals)
##     attributes  importance
## 1      country 0.015812466
## 2 device_brand 0.009781128
## 3 device_model 0.072322337
## 4         lang 0.013286518
## 5   multi_lang 0.003236746

One can show that the we gain lots of information from device_model, acceptable information from country & lang and very little & no significant information from device_brand or multi_lang.

Numerical Variables

Correlation

To see correlations of numeric variables with respect to churn, we use churn as an integer and compute the correlation matrix of our numerical features using corrplot library.

x_churnasint <- read.csv("ml_project_2022_churn_train.csv")
churn_int <- x_churnasint$churn
x_numeric <- x[, sapply(x, is.numeric)]

x_numeric_churnasint <- cbind(x_numeric, churn_int)

library(corrplot)
## corrplot 0.92 loaded
M <- cor(x_numeric_churnasint)
corrplot(M, method="color")

corrplot(M, method="number")

We see that the variables that are most correlated with churn are id, max_lvl_no, and max_ses_length.

session_cnt is more than %80 correlated with repeat, banner, and is counts.

bonus_count also is more than %80 correlated with repeat cnt.

T Test

library(stats)
x_numeric <- x[, sapply(x, is.numeric)]
x_numeric_churn <- cbind(x_numeric, x$churn)
numeric_names <- names(Filter(is.integer, x))

for (i in numeric_names) {
  print(i)
  print(t.test(x[,i]~x$churn, data = x_numeric_churn))

}
## [1] "X_id"
## 
##  Welch Two Sample t-test
## 
## data:  x[, i] by x$churn
## t = 31.96, df = 24074, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  723446.8 817979.3
## sample estimates:
## mean in group 0 mean in group 1 
##        91574158        90803445 
## 
## [1] "session_cnt"
## 
##  Welch Two Sample t-test
## 
## data:  x[, i] by x$churn
## t = -8.0774, df = 25327, p-value = 6.908e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -5.045189 -3.074799
## sample estimates:
## mean in group 0 mean in group 1 
##        6.036455       10.096448 
## 
## [1] "session_length"
## 
##  Welch Two Sample t-test
## 
## data:  x[, i] by x$churn
## t = -4.7324, df = 29640, p-value = 2.229e-06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -4745.812 -1965.974
## sample estimates:
## mean in group 0 mean in group 1 
##         6978.81        10334.70 
## 
## [1] "max_lvl_no"
## 
##  Welch Two Sample t-test
## 
## data:  x[, i] by x$churn
## t = 17.353, df = 14041, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  12.03940 15.10552
## sample estimates:
## mean in group 0 mean in group 1 
##        30.22043        16.64797 
## 
## [1] "gameplay_duration"
## 
##  Welch Two Sample t-test
## 
## data:  x[, i] by x$churn
## t = -1.5998, df = 20047, p-value = 0.1097
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -7105.7318   719.1563
## sample estimates:
## mean in group 0 mean in group 1 
##        5407.839        8601.127 
## 
## [1] "bonus_cnt"
## 
##  Welch Two Sample t-test
## 
## data:  x[, i] by x$churn
## t = -1.5341, df = 18515, p-value = 0.125
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -19.123202   2.331502
## sample estimates:
## mean in group 0 mean in group 1 
##        34.87840        43.27425 
## 
## [1] "hint1_cnt"
## 
##  Welch Two Sample t-test
## 
## data:  x[, i] by x$churn
## t = -2.5713, df = 22454, p-value = 0.01014
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.8806095 -0.1187855
## sample estimates:
## mean in group 0 mean in group 1 
##       0.1488563       0.6485538 
## 
## [1] "hint2_cnt"
## 
##  Welch Two Sample t-test
## 
## data:  x[, i] by x$churn
## t = -8.5327, df = 29987, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.9752406 -0.6108895
## sample estimates:
## mean in group 0 mean in group 1 
##        1.726501        2.519566 
## 
## [1] "hint3_cnt"
## 
##  Welch Two Sample t-test
## 
## data:  x[, i] by x$churn
## t = -1.6615, df = 14271, p-value = 0.09664
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -9.8285751  0.8105005
## sample estimates:
## mean in group 0 mean in group 1 
##        8.607934       13.116972 
## 
## [1] "repeat_cnt"
## 
##  Welch Two Sample t-test
## 
## data:  x[, i] by x$churn
## t = -3.752, df = 25447, p-value = 0.0001758
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -36.13017 -11.33456
## sample estimates:
## mean in group 0 mean in group 1 
##        56.99115        80.72352 
## 
## [1] "gold_cnt"
## 
##  Welch Two Sample t-test
## 
## data:  x[, i] by x$churn
## t = 4.8555, df = 17353, p-value = 1.211e-06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##   92.94369 218.78462
## sample estimates:
## mean in group 0 mean in group 1 
##        941.0042        785.1400 
## 
## [1] "banner_cnt"
## 
##  Welch Two Sample t-test
## 
## data:  x[, i] by x$churn
## t = -2.3004, df = 26228, p-value = 0.02144
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -36.395104  -2.907029
## sample estimates:
## mean in group 0 mean in group 1 
##        152.7311        172.3821 
## 
## [1] "is_cnt"
## 
##  Welch Two Sample t-test
## 
## data:  x[, i] by x$churn
## t = -2.7922, df = 28191, p-value = 0.005239
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -5.522520 -0.966992
## sample estimates:
## mean in group 0 mean in group 1 
##        20.78422        24.02898 
## 
## [1] "rv_cnt"
## 
##  Welch Two Sample t-test
## 
## data:  x[, i] by x$churn
## t = -1.1253, df = 15782, p-value = 0.2605
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -5.495740  1.487027
## sample estimates:
## mean in group 0 mean in group 1 
##        9.129736       11.134092 
## 
## [1] "max_ses_length"
## 
##  Welch Two Sample t-test
## 
## data:  x[, i] by x$churn
## t = -2.0159, df = 23626, p-value = 0.04382
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -1447.62949   -20.32813
## sample estimates:
## mean in group 0 mean in group 1 
##        2802.796        3536.775 
## 
## [1] "median_ses_length"
## 
##  Welch Two Sample t-test
## 
## data:  x[, i] by x$churn
## t = 34.374, df = 18995, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  517.5379 580.1295
## sample estimates:
## mean in group 0 mean in group 1 
##       1209.9140        661.0802 
## 
## [1] "avg_ses_length"
## 
##  Welch Two Sample t-test
## 
## data:  x[, i] by x$churn
## t = 9.4766, df = 23307, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  339.8406 517.0783
## sample estimates:
## mean in group 0 mean in group 1 
##       1365.7693        937.3099

ANOVA

for (i in numeric_names) {
  print(i)
  print(oneway.test(x[,i]~churn, data = x, var.equal = FALSE))
}
## [1] "X_id"
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  x[, i] and churn
## F = 1021.5, num df = 1, denom df = 24074, p-value < 2.2e-16
## 
## [1] "session_cnt"
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  x[, i] and churn
## F = 65.244, num df = 1, denom df = 25327, p-value = 6.908e-16
## 
## [1] "session_length"
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  x[, i] and churn
## F = 22.396, num df = 1, denom df = 29640, p-value = 2.229e-06
## 
## [1] "max_lvl_no"
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  x[, i] and churn
## F = 301.14, num df = 1, denom df = 14041, p-value < 2.2e-16
## 
## [1] "gameplay_duration"
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  x[, i] and churn
## F = 2.5593, num df = 1, denom df = 20047, p-value = 0.1097
## 
## [1] "bonus_cnt"
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  x[, i] and churn
## F = 2.3534, num df = 1, denom df = 18515, p-value = 0.125
## 
## [1] "hint1_cnt"
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  x[, i] and churn
## F = 6.6116, num df = 1, denom df = 22454, p-value = 0.01014
## 
## [1] "hint2_cnt"
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  x[, i] and churn
## F = 72.806, num df = 1, denom df = 29987, p-value < 2.2e-16
## 
## [1] "hint3_cnt"
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  x[, i] and churn
## F = 2.7605, num df = 1, denom df = 14271, p-value = 0.09664
## 
## [1] "repeat_cnt"
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  x[, i] and churn
## F = 14.078, num df = 1, denom df = 25447, p-value = 0.0001758
## 
## [1] "gold_cnt"
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  x[, i] and churn
## F = 23.576, num df = 1, denom df = 17353, p-value = 1.211e-06
## 
## [1] "banner_cnt"
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  x[, i] and churn
## F = 5.2916, num df = 1, denom df = 26228, p-value = 0.02144
## 
## [1] "is_cnt"
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  x[, i] and churn
## F = 7.7961, num df = 1, denom df = 28191, p-value = 0.005239
## 
## [1] "rv_cnt"
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  x[, i] and churn
## F = 1.2662, num df = 1, denom df = 15782, p-value = 0.2605
## 
## [1] "max_ses_length"
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  x[, i] and churn
## F = 4.0638, num df = 1, denom df = 23626, p-value = 0.04382
## 
## [1] "median_ses_length"
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  x[, i] and churn
## F = 1181.6, num df = 1, denom df = 18995, p-value < 2.2e-16
## 
## [1] "avg_ses_length"
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  x[, i] and churn
## F = 89.807, num df = 1, denom df = 23307, p-value < 2.2e-16

The results (p values) of the t test and ANOVA are identical. We can say that the variables that have p value less than 0.05 are significant, and more than 0.05 are not significant.

Non-significant variables:

  1. rv_cnt

  2. hint3_cnt

  3. gameplay_duration

Significant variables:

  1. max_ses_length

  2. banner_cnt

  3. hint1_cnt

Highly significant variables:

  1. avg_ses_length

  2. median_ses_length

  3. gold_cnt

  4. hint2_cnt

  5. max_lvl_no

  6. session_length

  7. session_cnt

  8. X_id

Logistic Regression

We will fit logistic regression with all numerical variables (except claim_gold_cnt), with significant and highly significant variables, and only with highly significant variables and compare the results of accuracy and coefficients.

All numeric variables

x_all_numeric <- x[, sapply(x, is.numeric)]
churn <- x$churn
x_all_numeric_churn <- cbind(x_all_numeric, churn)

x_all_numeric_churn_t = x_all_numeric_churn[1:20000, ]
x_all_numeric_churn_v = x_all_numeric_churn[20001:30000, ]

#model and coefficients
lg <- glm(churn~., data = x_all_numeric_churn_t, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lg)
## 
## Call:
## glm(formula = churn ~ ., family = "binomial", data = x_all_numeric_churn_t)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.6100  -1.1371   0.6929   0.8858   7.0236  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        2.054e+01  7.085e-01  28.996  < 2e-16 ***
## X_id              -2.149e-07  7.748e-09 -27.735  < 2e-16 ***
## session_cnt        2.082e-02  3.412e-03   6.101 1.06e-09 ***
## session_length    -5.641e-06  1.262e-06  -4.469 7.87e-06 ***
## max_lvl_no        -2.323e-02  1.084e-03 -21.440  < 2e-16 ***
## gameplay_duration -1.612e-06  1.239e-06  -1.301 0.193349    
## bonus_cnt         -1.483e-03  2.180e-04  -6.803 1.03e-11 ***
## hint1_cnt         -3.357e-03  1.790e-03  -1.875 0.060727 .  
## hint2_cnt          5.173e-02  4.792e-03  10.793  < 2e-16 ***
## hint3_cnt          4.975e-04  6.673e-04   0.746 0.455914    
## repeat_cnt         7.273e-04  3.592e-04   2.025 0.042910 *  
## gold_cnt           1.686e-04  1.218e-05  13.844  < 2e-16 ***
## banner_cnt        -2.582e-04  1.236e-04  -2.090 0.036658 *  
## is_cnt             5.791e-03  9.917e-04   5.839 5.26e-09 ***
## rv_cnt            -2.259e-03  5.585e-04  -4.045 5.23e-05 ***
## max_ses_length     1.189e-05  3.382e-06   3.515 0.000439 ***
## median_ses_length -4.509e-04  3.109e-05 -14.506  < 2e-16 ***
## avg_ses_length     3.663e-05  2.447e-05   1.497 0.134434    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26437  on 19999  degrees of freedom
## Residual deviance: 23620  on 19982  degrees of freedom
## AIC: 23656
## 
## Number of Fisher Scoring iterations: 6
#prediction and accuracy
lg.probs <- predict(lg, x_all_numeric_churn_v, type="response")
lg.pred = rep("0", 10000)
lg.pred[lg.probs >.55]="1"
table(lg.pred)
## lg.pred
##    0    1 
## 2823 7177
act_churn <- x_all_numeric_churn_v$churn

table(lg.pred,act_churn)
##        act_churn
## lg.pred    0    1
##       0 1816 1007
##       1 1901 5276
mean(lg.pred==act_churn)
## [1] 0.7092

We have 70.92% accuracy.

Significant numeric variables

x_all_numeric <- x[, sapply(x, is.numeric)]
x_all_numeric <- x_all_numeric[,!names(x_all_numeric) %in% c("rv_cnt", "gameplay_duration", "hint3_cnt")]
churn <- x$churn
x_all_numeric_churn <- cbind(x_all_numeric, churn)

x_all_numeric_churn_t = x_all_numeric_churn[1:20000, ]
x_all_numeric_churn_v = x_all_numeric_churn[20001:30000, ]

#model and coefficients
lg <- glm(churn~., data = x_all_numeric_churn_t, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lg)
## 
## Call:
## glm(formula = churn ~ ., family = "binomial", data = x_all_numeric_churn_t)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.6704  -1.1376   0.6932   0.8854   7.2544  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        2.051e+01  7.081e-01  28.963  < 2e-16 ***
## X_id              -2.144e-07  7.744e-09 -27.692  < 2e-16 ***
## session_cnt        1.891e-02  3.291e-03   5.746 9.15e-09 ***
## session_length    -5.471e-06  1.146e-06  -4.772 1.82e-06 ***
## max_lvl_no        -2.289e-02  1.075e-03 -21.306  < 2e-16 ***
## bonus_cnt         -1.256e-03  2.031e-04  -6.187 6.14e-10 ***
## hint1_cnt         -3.949e-03  1.742e-03  -2.266 0.023427 *  
## hint2_cnt          5.156e-02  4.820e-03  10.696  < 2e-16 ***
## repeat_cnt         4.932e-04  3.500e-04   1.409 0.158741    
## gold_cnt           1.651e-04  1.256e-05  13.139  < 2e-16 ***
## banner_cnt        -3.768e-04  1.153e-04  -3.269 0.001078 ** 
## is_cnt             5.956e-03  9.883e-04   6.027 1.67e-09 ***
## max_ses_length     1.169e-05  3.252e-06   3.596 0.000323 ***
## median_ses_length -4.553e-04  3.111e-05 -14.637  < 2e-16 ***
## avg_ses_length     3.684e-05  2.460e-05   1.497 0.134270    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26437  on 19999  degrees of freedom
## Residual deviance: 23634  on 19985  degrees of freedom
## AIC: 23664
## 
## Number of Fisher Scoring iterations: 6
#prediction and accuracy
lg.probs <- predict(lg, x_all_numeric_churn_v, type="response")
lg.pred = rep("0", 10000)
lg.pred[lg.probs >.55]="1"
table(lg.pred)
## lg.pred
##    0    1 
## 2820 7180
act_churn <- x_all_numeric_churn_v$churn

table(lg.pred,act_churn)
##        act_churn
## lg.pred    0    1
##       0 1811 1009
##       1 1906 5274
mean(lg.pred==act_churn)
## [1] 0.7085

We have %70.85 accuracy

Highly significant numeric variables

x_all_numeric <- x[, sapply(x, is.numeric)]
x_all_numeric <- x_all_numeric[,!names(x_all_numeric) %in% c("rv_cnt", "gameplay_duration", "hint3_cnt", "max_session_length", "hint1_cnt", "banner_cnt")]
churn <- x$churn
x_all_numeric_churn <- cbind(x_all_numeric, churn)

x_all_numeric_churn_t = x_all_numeric_churn[1:20000, ]
x_all_numeric_churn_v = x_all_numeric_churn[20001:30000, ]

#model and coefficients
lg <- glm(churn~., data = x_all_numeric_churn_t, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lg)
## 
## Call:
## glm(formula = churn ~ ., family = "binomial", data = x_all_numeric_churn_t)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.6223  -1.1390   0.6940   0.8861   7.0999  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        2.052e+01  7.078e-01  28.985  < 2e-16 ***
## X_id              -2.144e-07  7.741e-09 -27.694  < 2e-16 ***
## session_cnt        1.477e-02  3.091e-03   4.777 1.78e-06 ***
## session_length    -5.238e-06  1.160e-06  -4.514 6.36e-06 ***
## max_lvl_no        -2.290e-02  1.082e-03 -21.161  < 2e-16 ***
## bonus_cnt         -9.972e-04  2.059e-04  -4.844 1.27e-06 ***
## hint2_cnt          5.122e-02  4.883e-03  10.489  < 2e-16 ***
## repeat_cnt         1.861e-04  3.408e-04   0.546 0.584993    
## gold_cnt           1.649e-04  1.308e-05  12.608  < 2e-16 ***
## is_cnt             4.797e-03  9.210e-04   5.208 1.90e-07 ***
## max_ses_length     1.117e-05  3.291e-06   3.393 0.000691 ***
## median_ses_length -4.680e-04  3.113e-05 -15.034  < 2e-16 ***
## avg_ses_length     3.810e-05  2.496e-05   1.526 0.126887    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26437  on 19999  degrees of freedom
## Residual deviance: 23646  on 19987  degrees of freedom
## AIC: 23672
## 
## Number of Fisher Scoring iterations: 6
#prediction and accuracy
lg.probs <- predict(lg, x_all_numeric_churn_v, type="response")
lg.pred = rep("0", 10000)
lg.pred[lg.probs >.55]="1"
table(lg.pred)
## lg.pred
##    0    1 
## 2804 7196
act_churn <- x_all_numeric_churn_v$churn

table(lg.pred,act_churn)
##        act_churn
## lg.pred    0    1
##       0 1796 1008
##       1 1921 5275
mean(lg.pred==act_churn)
## [1] 0.7071

We have %70.71 accuracy.

Now, we will try removing some highly correlated features and see if our accuracy will increase.

We will remove repeat, banner, and is counts.

x_all_numeric <- x[, sapply(x, is.numeric)]
x_all_numeric <- x_all_numeric[,!names(x_all_numeric) %in% c("repeat_cnt", "banner_cnt", "is_cnt")]
churn <- x$churn
x_all_numeric_churn <- cbind(x_all_numeric, churn)

x_all_numeric_churn_t = x_all_numeric_churn[1:20000, ]
x_all_numeric_churn_v = x_all_numeric_churn[20001:30000, ]

#model and coefficients
lg <- glm(churn~., data = x_all_numeric_churn_t, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lg)
## 
## Call:
## glm(formula = churn ~ ., family = "binomial", data = x_all_numeric_churn_t)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.3031  -1.1436   0.6959   0.8866   6.8192  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        2.029e+01  7.062e-01  28.733  < 2e-16 ***
## X_id              -2.125e-07  7.728e-09 -27.495  < 2e-16 ***
## session_cnt        2.929e-02  2.984e-03   9.818  < 2e-16 ***
## session_length    -5.279e-06  1.460e-06  -3.615 0.000301 ***
## max_lvl_no        -2.085e-02  1.000e-03 -20.849  < 2e-16 ***
## gameplay_duration -8.792e-07  1.084e-06  -0.811 0.417256    
## bonus_cnt         -7.855e-04  1.171e-04  -6.708 1.98e-11 ***
## hint1_cnt         -3.856e-03  2.098e-03  -1.838 0.066061 .  
## hint2_cnt          5.348e-02  4.824e-03  11.087  < 2e-16 ***
## hint3_cnt          4.842e-04  8.398e-04   0.577 0.564224    
## gold_cnt           1.557e-04  1.110e-05  14.024  < 2e-16 ***
## rv_cnt            -1.806e-03  5.505e-04  -3.282 0.001032 ** 
## max_ses_length     1.045e-05  3.669e-06   2.847 0.004414 ** 
## median_ses_length -4.391e-04  3.084e-05 -14.237  < 2e-16 ***
## avg_ses_length     4.206e-05  2.457e-05   1.712 0.086842 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26437  on 19999  degrees of freedom
## Residual deviance: 23670  on 19985  degrees of freedom
## AIC: 23700
## 
## Number of Fisher Scoring iterations: 6
#prediction and accuracy
lg.probs <- predict(lg, x_all_numeric_churn_v, type="response")
lg.pred = rep("0", 10000)
lg.pred[lg.probs >.55]="1"
table(lg.pred)
## lg.pred
##    0    1 
## 2780 7220
act_churn <- x_all_numeric_churn_v$churn

table(lg.pred,act_churn)
##        act_churn
## lg.pred    0    1
##       0 1778 1002
##       1 1939 5281
mean(lg.pred==act_churn)
## [1] 0.7059

Our accuracy is %70.59.

Logistic regression on all numeric variables but one (iterate through all)

for (i in numeric_names) {
x_all_numeric <- x[, sapply(x, is.numeric)]
x_all_numeric <- x_all_numeric[,!names(x_all_numeric) %in% c(i)]
churn <- x$churn
x_all_numeric_churn <- cbind(x_all_numeric, churn)

x_all_numeric_churn_t = x_all_numeric_churn[1:20000, ]
x_all_numeric_churn_v = x_all_numeric_churn[20001:30000, ]

#model and coefficients
lg <- glm(churn~., data = x_all_numeric_churn_t, family = "binomial")
summary(lg)

#prediction and accuracy
lg.probs <- predict(lg, x_all_numeric_churn_v, type="response")
lg.pred = rep("0", 10000)
lg.pred[lg.probs >.55]="1"

act_churn <- x_all_numeric_churn_v$churn

print(i)
print(mean(lg.pred==act_churn))
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "X_id"
## [1] 0.6921
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "session_cnt"
## [1] 0.7058
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "session_length"
## [1] 0.7088
## [1] "max_lvl_no"
## [1] 0.6863
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "gameplay_duration"
## [1] 0.7095
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "bonus_cnt"
## [1] 0.7072
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "hint1_cnt"
## [1] 0.7092
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "hint2_cnt"
## [1] 0.7044
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "hint3_cnt"
## [1] 0.7092
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "repeat_cnt"
## [1] 0.7088
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "gold_cnt"
## [1] 0.709
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "banner_cnt"
## [1] 0.7082
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "is_cnt"
## [1] 0.7066
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "rv_cnt"
## [1] 0.7084
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "max_ses_length"
## [1] 0.7089
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "median_ses_length"
## [1] 0.7067
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "avg_ses_length"
## [1] 0.7092

We see that when we remove “avg_ses_length”, “hint3_cnt”, or “hint1_cnt” -> our accuracy does not change

When we remove “gameplay_duration” -> our accuracy increases

When we remove “X_id” or “max_lvl_no” -> our accuracy decreases by a lot

Conclusion & Insights from Logistic Regression Experiments

Our accuracy did not improve when we removed the features that ANOVA and T test deemed as “not significant” (large p value).

avg_ses_length wasn’t of any importance to our logistic regression model even though it had a small p value.

gameplay_duration had a negative impact on the accuracy of our model.

id and max_lvl_no had significant positive impact on our model.

2. Feature Engineering

2.1 Additional features

Formulating Skill

max_lvl_cnt/gameplay_duration -> gives us how far did player reach by the time he/she played the game.

This might seem reasonable at first, but we are not taking into account that it will probably be harder to gain levels as we move forward in the game. This means that if we use this formula as skill, we are punishing people who move forward in the game and someone who played very little and passed only one or two levels can come across as high skilled players.

We can try squaring the level count to try to balance the penalty comes from game getting harder by time:

skill = (max_lvl_cnt)^2/gameplay_duration

Another thing we did not take into account is usage of features in the game that helps players get levels but has nothing to do with skill. If a person uses bonus or repeat features, or watches rewarded ads which usually boost you in the game, or use the hints more, it will be way easier to gain levels for the player. So we should penalize people who uses those features if we are formulating “skill”. We add this to the formula as:

skill = (max_lvl_cnt)^2/gameplay_duration+repeat_cnt+bonus_cnt+rv_cnt+hint1_cnt+hint2_cnt+hint3_cnt

Although, (for example) the gameplay_duration will be much higher in units than the other features, it will dominate the whole formula. Because of this issue, we should normalize all these features before taking them into account.

Normalizing the features:

library(BBmisc)
## 
## Attaching package: 'BBmisc'
## The following objects are masked from 'package:sjmisc':
## 
##     %nin%, seq_col, seq_row
## The following object is masked from 'package:base':
## 
##     isFALSE
x_normalized <- x
x_normalized = normalize(x_normalized, method = "range", range = c(0, 1))
#penalized features -> repeat_cnt+bonus_cnt+rv_cnt+hint1_cnt+hint2_cnt+hint3_cnt

library(data.table)
xvdt <- data.table(x_normalized)
xvdt[, skill := max_lvl_no/gameplay_duration]
str(xvdt)
## Classes 'data.table' and 'data.frame':   30000 obs. of  24 variables:
##  $ X_id             : num  0.000118 0.000391 0.000756 0.001122 0.001394 ...
##  $ country          : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
##  $ device_brand     : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
##  $ device_model     : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
##  $ session_cnt      : num  0.002616 0 0 0.000436 0 ...
##  $ session_length   : num  0.002579 0.000352 0.000417 0.000252 0.000128 ...
##  $ lang             : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
##  $ max_lvl_no       : num  0.00332 0.00221 0.00747 0.0036 0.00304 ...
##  $ gameplay_duration: num  2.96e-05 5.18e-05 4.60e-05 3.33e-05 1.44e-05 ...
##  $ bonus_cnt        : num  0.000168 0.000252 0.000224 0.000056 0.000112 ...
##  $ hint1_cnt        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hint2_cnt        : num  0.00355 0.00178 0 0 0.00178 ...
##  $ hint3_cnt        : num  0 0 0 0 0 ...
##  $ repeat_cnt       : num  0.00023 0.000328 0.000689 0.00023 0.000328 ...
##  $ gold_cnt         : num  0.001926 0.000115 0.00456 0.006265 0.006297 ...
##  $ banner_cnt       : num  0.000556 0.001242 0 0.001601 0.000752 ...
##  $ is_cnt           : num  0.000114 0 0.001824 0.000228 0 ...
##  $ rv_cnt           : num  0.00 7.41e-05 7.41e-05 7.41e-05 7.41e-05 ...
##  $ churn            : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ multi_lang       : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_ses_length   : num  0.002441 0.000353 0.000418 0.000243 0.000128 ...
##  $ median_ses_length: num  0.00115 0.02083 0.02465 0.00747 0.00767 ...
##  $ avg_ses_length   : num  0.002705 0.002602 0.003083 0.000922 0.000947 ...
##  $ skill            : num  112 42.7 162.5 108 210.7 ...
##  - attr(*, ".internal.selfref")=<externalptr>
 quartiles <- quantile(xvdt$skill, probs=c(.25, .75), na.rm = FALSE)
  IQR <- IQR(xvdt$skill)
  
  Lower <- quartiles[1] - 1.5*IQR
  Upper <- quartiles[2] + 1.5*IQR
  
  xvdt_skill_no <- subset(xvdt, xvdt$skill > Lower & xvdt$skill < Upper)

  dim(xvdt_skill_no)
## [1] 28857    24
library(ggplot2)
ggplot(xvdt_skill_no, aes(skill)) + geom_histogram(bins = 1000) + facet_wrap(xvdt_skill_no$churn)

xvdt_skill_no$churn <- as.numeric(xvdt_skill_no$churn)
str(xvdt_skill_no)
## Classes 'data.table' and 'data.frame':   28857 obs. of  24 variables:
##  $ X_id             : num  0.000118 0.000391 0.000756 0.001122 0.001394 ...
##  $ country          : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
##  $ device_brand     : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
##  $ device_model     : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
##  $ session_cnt      : num  0.002616 0 0 0.000436 0 ...
##  $ session_length   : num  0.002579 0.000352 0.000417 0.000252 0.000128 ...
##  $ lang             : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
##  $ max_lvl_no       : num  0.00332 0.00221 0.00747 0.0036 0.00304 ...
##  $ gameplay_duration: num  2.96e-05 5.18e-05 4.60e-05 3.33e-05 1.44e-05 ...
##  $ bonus_cnt        : num  0.000168 0.000252 0.000224 0.000056 0.000112 ...
##  $ hint1_cnt        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hint2_cnt        : num  0.00355 0.00178 0 0 0.00178 ...
##  $ hint3_cnt        : num  0 0 0 0 0 ...
##  $ repeat_cnt       : num  0.00023 0.000328 0.000689 0.00023 0.000328 ...
##  $ gold_cnt         : num  0.001926 0.000115 0.00456 0.006265 0.006297 ...
##  $ banner_cnt       : num  0.000556 0.001242 0 0.001601 0.000752 ...
##  $ is_cnt           : num  0.000114 0 0.001824 0.000228 0 ...
##  $ rv_cnt           : num  0.00 7.41e-05 7.41e-05 7.41e-05 7.41e-05 ...
##  $ churn            : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ multi_lang       : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_ses_length   : num  0.002441 0.000353 0.000418 0.000243 0.000128 ...
##  $ median_ses_length: num  0.00115 0.02083 0.02465 0.00747 0.00767 ...
##  $ avg_ses_length   : num  0.002705 0.002602 0.003083 0.000922 0.000947 ...
##  $ skill            : num  112 42.7 162.5 108 210.7 ...
##  - attr(*, ".internal.selfref")=<externalptr>
dim(xvdt_skill_no)
## [1] 28857    24
print(cor(xvdt_skill_no$skill, xvdt_skill_no$churn))
## [1] -0.2346475

When we create a new column skill -> max_lvl_no/gameplay_duration, we see the distribution as in the plot and we find the correlation of new feature with the response column as -%23, which means it is highly correlated compared to other features and people seem to get less churned as they are more skilled.

library(BBmisc)
x_normalized <- x
x_normalized = normalize(x_normalized, method = "range", range = c(0, 1))
#penalized features -> repeat_cnt+bonus_cnt+rv_cnt+hint1_cnt+hint2_cnt+hint3_cnt

library(data.table)
xvdt <- data.table(x_normalized)
xvdt[, skill := max_lvl_no^2/gameplay_duration]
str(xvdt)
## Classes 'data.table' and 'data.frame':   30000 obs. of  24 variables:
##  $ X_id             : num  0.000118 0.000391 0.000756 0.001122 0.001394 ...
##  $ country          : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
##  $ device_brand     : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
##  $ device_model     : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
##  $ session_cnt      : num  0.002616 0 0 0.000436 0 ...
##  $ session_length   : num  0.002579 0.000352 0.000417 0.000252 0.000128 ...
##  $ lang             : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
##  $ max_lvl_no       : num  0.00332 0.00221 0.00747 0.0036 0.00304 ...
##  $ gameplay_duration: num  2.96e-05 5.18e-05 4.60e-05 3.33e-05 1.44e-05 ...
##  $ bonus_cnt        : num  0.000168 0.000252 0.000224 0.000056 0.000112 ...
##  $ hint1_cnt        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hint2_cnt        : num  0.00355 0.00178 0 0 0.00178 ...
##  $ hint3_cnt        : num  0 0 0 0 0 ...
##  $ repeat_cnt       : num  0.00023 0.000328 0.000689 0.00023 0.000328 ...
##  $ gold_cnt         : num  0.001926 0.000115 0.00456 0.006265 0.006297 ...
##  $ banner_cnt       : num  0.000556 0.001242 0 0.001601 0.000752 ...
##  $ is_cnt           : num  0.000114 0 0.001824 0.000228 0 ...
##  $ rv_cnt           : num  0.00 7.41e-05 7.41e-05 7.41e-05 7.41e-05 ...
##  $ churn            : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ multi_lang       : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_ses_length   : num  0.002441 0.000353 0.000418 0.000243 0.000128 ...
##  $ median_ses_length: num  0.00115 0.02083 0.02465 0.00747 0.00767 ...
##  $ avg_ses_length   : num  0.002705 0.002602 0.003083 0.000922 0.000947 ...
##  $ skill            : num  0.3717 0.0945 1.2133 0.3885 0.6412 ...
##  - attr(*, ".internal.selfref")=<externalptr>
 quartiles <- quantile(xvdt$skill, probs=c(.25, .75), na.rm = FALSE)
  IQR <- IQR(xvdt$skill)
  
  Lower <- quartiles[1] - 1.5*IQR
  Upper <- quartiles[2] + 1.5*IQR
  
  xvdt_skill_no <- subset(xvdt, xvdt$skill > Lower & xvdt$skill < Upper)

  dim(xvdt_skill_no)
## [1] 28018    24
library(ggplot2)
ggplot(xvdt_skill_no, aes(skill)) + geom_histogram(bins = 1000) + facet_wrap(xvdt_skill_no$churn)

xvdt_skill_no$churn <- as.numeric(xvdt_skill_no$churn)
str(xvdt_skill_no)
## Classes 'data.table' and 'data.frame':   28018 obs. of  24 variables:
##  $ X_id             : num  0.000118 0.000391 0.000756 0.001122 0.001394 ...
##  $ country          : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
##  $ device_brand     : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
##  $ device_model     : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
##  $ session_cnt      : num  0.002616 0 0 0.000436 0 ...
##  $ session_length   : num  0.002579 0.000352 0.000417 0.000252 0.000128 ...
##  $ lang             : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
##  $ max_lvl_no       : num  0.00332 0.00221 0.00747 0.0036 0.00304 ...
##  $ gameplay_duration: num  2.96e-05 5.18e-05 4.60e-05 3.33e-05 1.44e-05 ...
##  $ bonus_cnt        : num  0.000168 0.000252 0.000224 0.000056 0.000112 ...
##  $ hint1_cnt        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hint2_cnt        : num  0.00355 0.00178 0 0 0.00178 ...
##  $ hint3_cnt        : num  0 0 0 0 0 ...
##  $ repeat_cnt       : num  0.00023 0.000328 0.000689 0.00023 0.000328 ...
##  $ gold_cnt         : num  0.001926 0.000115 0.00456 0.006265 0.006297 ...
##  $ banner_cnt       : num  0.000556 0.001242 0 0.001601 0.000752 ...
##  $ is_cnt           : num  0.000114 0 0.001824 0.000228 0 ...
##  $ rv_cnt           : num  0.00 7.41e-05 7.41e-05 7.41e-05 7.41e-05 ...
##  $ churn            : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ multi_lang       : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_ses_length   : num  0.002441 0.000353 0.000418 0.000243 0.000128 ...
##  $ median_ses_length: num  0.00115 0.02083 0.02465 0.00747 0.00767 ...
##  $ avg_ses_length   : num  0.002705 0.002602 0.003083 0.000922 0.000947 ...
##  $ skill            : num  0.3717 0.0945 1.2133 0.3885 0.6412 ...
##  - attr(*, ".internal.selfref")=<externalptr>
print(cor(xvdt_skill_no$skill, xvdt_skill_no$churn))
## [1] -0.3210445
dim(xvdt_skill_no)
## [1] 28018    24

We see that when take the game becoming harder by time into account thus squaring the max_lvl_count as we talked about before, correlation is even higher with %32, although we need to remove 2k rows which are outliers for the skill feature. This is a sacrifice that can be made for a feature that is %32 correlated with our response column because we have no other feature that is even close to this number.

library(BBmisc)
x_normalized <- x
x_normalized = normalize(x_normalized, method = "range", range = c(0, 1))
#penalized features -> repeat_cnt+bonus_cnt+rv_cnt+hint1_cnt+hint2_cnt+hint3_cnt

library(data.table)
xvdt <- data.table(x_normalized)
xvdt[, skill := max_lvl_no^2/gameplay_duration+repeat_cnt+bonus_cnt+rv_cnt+hint1_cnt+hint2_cnt+hint3_cnt]
str(xvdt)
## Classes 'data.table' and 'data.frame':   30000 obs. of  24 variables:
##  $ X_id             : num  0.000118 0.000391 0.000756 0.001122 0.001394 ...
##  $ country          : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
##  $ device_brand     : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
##  $ device_model     : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
##  $ session_cnt      : num  0.002616 0 0 0.000436 0 ...
##  $ session_length   : num  0.002579 0.000352 0.000417 0.000252 0.000128 ...
##  $ lang             : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
##  $ max_lvl_no       : num  0.00332 0.00221 0.00747 0.0036 0.00304 ...
##  $ gameplay_duration: num  2.96e-05 5.18e-05 4.60e-05 3.33e-05 1.44e-05 ...
##  $ bonus_cnt        : num  0.000168 0.000252 0.000224 0.000056 0.000112 ...
##  $ hint1_cnt        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hint2_cnt        : num  0.00355 0.00178 0 0 0.00178 ...
##  $ hint3_cnt        : num  0 0 0 0 0 ...
##  $ repeat_cnt       : num  0.00023 0.000328 0.000689 0.00023 0.000328 ...
##  $ gold_cnt         : num  0.001926 0.000115 0.00456 0.006265 0.006297 ...
##  $ banner_cnt       : num  0.000556 0.001242 0 0.001601 0.000752 ...
##  $ is_cnt           : num  0.000114 0 0.001824 0.000228 0 ...
##  $ rv_cnt           : num  0.00 7.41e-05 7.41e-05 7.41e-05 7.41e-05 ...
##  $ churn            : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ multi_lang       : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_ses_length   : num  0.002441 0.000353 0.000418 0.000243 0.000128 ...
##  $ median_ses_length: num  0.00115 0.02083 0.02465 0.00747 0.00767 ...
##  $ avg_ses_length   : num  0.002705 0.002602 0.003083 0.000922 0.000947 ...
##  $ skill            : num  0.376 0.097 1.214 0.389 0.643 ...
##  - attr(*, ".internal.selfref")=<externalptr>
 quartiles <- quantile(xvdt$skill, probs=c(.25, .75), na.rm = FALSE)
  IQR <- IQR(xvdt$skill)
  
  Lower <- quartiles[1] - 1.5*IQR
  Upper <- quartiles[2] + 1.5*IQR
  
  xvdt_skill_no <- subset(xvdt, xvdt$skill > Lower & xvdt$skill < Upper)

  dim(xvdt_skill_no)
## [1] 27984    24
library(ggplot2)
ggplot(xvdt_skill_no, aes(skill)) + geom_histogram(bins = 1000) + facet_wrap(xvdt_skill_no$churn)

xvdt_skill_no$churn <- as.numeric(xvdt_skill_no$churn)
str(xvdt_skill_no)
## Classes 'data.table' and 'data.frame':   27984 obs. of  24 variables:
##  $ X_id             : num  0.000118 0.000391 0.000756 0.001122 0.001394 ...
##  $ country          : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
##  $ device_brand     : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
##  $ device_model     : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
##  $ session_cnt      : num  0.002616 0 0 0.000436 0 ...
##  $ session_length   : num  0.002579 0.000352 0.000417 0.000252 0.000128 ...
##  $ lang             : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
##  $ max_lvl_no       : num  0.00332 0.00221 0.00747 0.0036 0.00304 ...
##  $ gameplay_duration: num  2.96e-05 5.18e-05 4.60e-05 3.33e-05 1.44e-05 ...
##  $ bonus_cnt        : num  0.000168 0.000252 0.000224 0.000056 0.000112 ...
##  $ hint1_cnt        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hint2_cnt        : num  0.00355 0.00178 0 0 0.00178 ...
##  $ hint3_cnt        : num  0 0 0 0 0 ...
##  $ repeat_cnt       : num  0.00023 0.000328 0.000689 0.00023 0.000328 ...
##  $ gold_cnt         : num  0.001926 0.000115 0.00456 0.006265 0.006297 ...
##  $ banner_cnt       : num  0.000556 0.001242 0 0.001601 0.000752 ...
##  $ is_cnt           : num  0.000114 0 0.001824 0.000228 0 ...
##  $ rv_cnt           : num  0.00 7.41e-05 7.41e-05 7.41e-05 7.41e-05 ...
##  $ churn            : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ multi_lang       : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_ses_length   : num  0.002441 0.000353 0.000418 0.000243 0.000128 ...
##  $ median_ses_length: num  0.00115 0.02083 0.02465 0.00747 0.00767 ...
##  $ avg_ses_length   : num  0.002705 0.002602 0.003083 0.000922 0.000947 ...
##  $ skill            : num  0.376 0.097 1.214 0.389 0.643 ...
##  - attr(*, ".internal.selfref")=<externalptr>
print(cor(xvdt_skill_no$skill, xvdt_skill_no$churn))
## [1] -0.3200455
dim(xvdt_skill_no)
## [1] 27984    24

However when we add the penalizing features into equation that boosts a players levels, corelation does not improve. This is why we will not use them and use our new column skill as (max_lvl_cnt)^2/gameplay_duration.

Experimenting with logistic regression when skill added

Add new feature:

#normalize
library(BBmisc)
x_normalized <- x
x_normalized = normalize(x_normalized, method = "range", range = c(0, 1))

#add new column
library(data.table)
xvdt <- data.table(x_normalized)
xvdt[, skill := max_lvl_no^2/gameplay_duration]
str(xvdt)
## Classes 'data.table' and 'data.frame':   30000 obs. of  24 variables:
##  $ X_id             : num  0.000118 0.000391 0.000756 0.001122 0.001394 ...
##  $ country          : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
##  $ device_brand     : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
##  $ device_model     : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
##  $ session_cnt      : num  0.002616 0 0 0.000436 0 ...
##  $ session_length   : num  0.002579 0.000352 0.000417 0.000252 0.000128 ...
##  $ lang             : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
##  $ max_lvl_no       : num  0.00332 0.00221 0.00747 0.0036 0.00304 ...
##  $ gameplay_duration: num  2.96e-05 5.18e-05 4.60e-05 3.33e-05 1.44e-05 ...
##  $ bonus_cnt        : num  0.000168 0.000252 0.000224 0.000056 0.000112 ...
##  $ hint1_cnt        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hint2_cnt        : num  0.00355 0.00178 0 0 0.00178 ...
##  $ hint3_cnt        : num  0 0 0 0 0 ...
##  $ repeat_cnt       : num  0.00023 0.000328 0.000689 0.00023 0.000328 ...
##  $ gold_cnt         : num  0.001926 0.000115 0.00456 0.006265 0.006297 ...
##  $ banner_cnt       : num  0.000556 0.001242 0 0.001601 0.000752 ...
##  $ is_cnt           : num  0.000114 0 0.001824 0.000228 0 ...
##  $ rv_cnt           : num  0.00 7.41e-05 7.41e-05 7.41e-05 7.41e-05 ...
##  $ churn            : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ multi_lang       : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_ses_length   : num  0.002441 0.000353 0.000418 0.000243 0.000128 ...
##  $ median_ses_length: num  0.00115 0.02083 0.02465 0.00747 0.00767 ...
##  $ avg_ses_length   : num  0.002705 0.002602 0.003083 0.000922 0.000947 ...
##  $ skill            : num  0.3717 0.0945 1.2133 0.3885 0.6412 ...
##  - attr(*, ".internal.selfref")=<externalptr>
#remove outliers from skill
 quartiles <- quantile(xvdt$skill, probs=c(.25, .75), na.rm = FALSE)
  IQR <- IQR(xvdt$skill)
  
  Lower <- quartiles[1] - 1.5*IQR
  Upper <- quartiles[2] + 1.5*IQR
  
  xvdt_skill_no <- subset(xvdt, xvdt$skill > Lower & xvdt$skill < Upper)

#check the work
str(xvdt_skill_no)
## Classes 'data.table' and 'data.frame':   28018 obs. of  24 variables:
##  $ X_id             : num  0.000118 0.000391 0.000756 0.001122 0.001394 ...
##  $ country          : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
##  $ device_brand     : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
##  $ device_model     : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
##  $ session_cnt      : num  0.002616 0 0 0.000436 0 ...
##  $ session_length   : num  0.002579 0.000352 0.000417 0.000252 0.000128 ...
##  $ lang             : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
##  $ max_lvl_no       : num  0.00332 0.00221 0.00747 0.0036 0.00304 ...
##  $ gameplay_duration: num  2.96e-05 5.18e-05 4.60e-05 3.33e-05 1.44e-05 ...
##  $ bonus_cnt        : num  0.000168 0.000252 0.000224 0.000056 0.000112 ...
##  $ hint1_cnt        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hint2_cnt        : num  0.00355 0.00178 0 0 0.00178 ...
##  $ hint3_cnt        : num  0 0 0 0 0 ...
##  $ repeat_cnt       : num  0.00023 0.000328 0.000689 0.00023 0.000328 ...
##  $ gold_cnt         : num  0.001926 0.000115 0.00456 0.006265 0.006297 ...
##  $ banner_cnt       : num  0.000556 0.001242 0 0.001601 0.000752 ...
##  $ is_cnt           : num  0.000114 0 0.001824 0.000228 0 ...
##  $ rv_cnt           : num  0.00 7.41e-05 7.41e-05 7.41e-05 7.41e-05 ...
##  $ churn            : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ multi_lang       : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_ses_length   : num  0.002441 0.000353 0.000418 0.000243 0.000128 ...
##  $ median_ses_length: num  0.00115 0.02083 0.02465 0.00747 0.00767 ...
##  $ avg_ses_length   : num  0.002705 0.002602 0.003083 0.000922 0.000947 ...
##  $ skill            : num  0.3717 0.0945 1.2133 0.3885 0.6412 ...
##  - attr(*, ".internal.selfref")=<externalptr>
dim(xvdt_skill_no)
## [1] 28018    24

Fit and calculate accuracy logistic regression with only numerical variables:

xvdt_skill_no <- as.data.frame(xvdt_skill_no)
xvdt_skill_no_all_numeric <- xvdt_skill_no[, sapply(xvdt_skill_no, is.numeric)]
churn <- xvdt_skill_no$churn
xvdt_skill_no_all_numeric_churn <- cbind(xvdt_skill_no_all_numeric, churn)

xvdt_skill_no_all_numeric_churn_t = xvdt_skill_no_all_numeric_churn[1:20000, ]
xvdt_skill_no_all_numeric_churn_v = xvdt_skill_no_all_numeric_churn[20001:28000, ]

#model and coefficients
lg <- glm(churn~., data = xvdt_skill_no_all_numeric_churn_t, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lg)
## 
## Call:
## glm(formula = churn ~ ., family = "binomial", data = xvdt_skill_no_all_numeric_churn_t)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.7880  -1.0229   0.6018   0.8350   5.2618  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.08846    0.04365  47.842  < 2e-16 ***
## X_id               -1.53990    0.05412 -28.452  < 2e-16 ***
## session_cnt        15.81115    6.09857   2.593  0.00953 ** 
## session_length    -12.97120    5.10267  -2.542  0.01102 *  
## max_lvl_no        -46.51730    6.38531  -7.285 3.22e-13 ***
## gameplay_duration -92.49887   48.30935  -1.915  0.05553 .  
## bonus_cnt         -19.89776    4.82958  -4.120 3.79e-05 ***
## hint1_cnt          -6.27388    4.52125  -1.388  0.16525    
## hint2_cnt          20.25634    2.90030   6.984 2.86e-12 ***
## hint3_cnt           6.46183   11.15791   0.579  0.56250    
## repeat_cnt          1.80539    5.38264   0.335  0.73732    
## gold_cnt           48.53020    7.13909   6.798 1.06e-11 ***
## banner_cnt         -6.96282    2.91152  -2.391  0.01678 *  
## is_cnt             39.34391    7.07149   5.564 2.64e-08 ***
## rv_cnt             -7.20974    6.41170  -1.124  0.26082    
## max_ses_length     46.28829   26.20220   1.767  0.07730 .  
## median_ses_length -33.42723    2.94380 -11.355  < 2e-16 ***
## avg_ses_length     13.35168   20.55285   0.650  0.51593    
## skill              -1.27390    0.05013 -25.414  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26006  on 19999  degrees of freedom
## Residual deviance: 22305  on 19981  degrees of freedom
## AIC: 22343
## 
## Number of Fisher Scoring iterations: 6
#prediction and accuracy
lg.probs <- predict(lg, xvdt_skill_no_all_numeric_churn_v, type="response")
lg.pred = rep("0", 8000)
lg.pred[lg.probs >.55]="1"
table(lg.pred)
## lg.pred
##    0    1 
## 2201 5799
act_churn <- xvdt_skill_no_all_numeric_churn_v$churn

table(lg.pred,act_churn)
##        act_churn
## lg.pred    0    1
##       0 1397  804
##       1 1402 4397
mean(lg.pred==act_churn)
## [1] 0.72425

As we can see our predicton accuracy improved to %72.4 and the p value of skill is very low. Now we try logistic regression without the variables with high p values in the previous logistic regression:

xvdt_skill_no <- as.data.frame(xvdt_skill_no)
xvdt_skill_no_all_numeric <- xvdt_skill_no[, sapply(xvdt_skill_no, is.numeric)]
churn <- xvdt_skill_no$churn
xvdt_skill_no_all_numeric_churn <- cbind(xvdt_skill_no_all_numeric, churn)

xvdt_skill_no_all_numeric_churn_t = xvdt_skill_no_all_numeric_churn[1:20000, ]
xvdt_skill_no_all_numeric_churn_v = xvdt_skill_no_all_numeric_churn[20001:28000, ]

numeric_names <- append(numeric_names, "skill")

#model and coefficients
lg <- glm(churn~skill, data = xvdt_skill_no_all_numeric_churn_t, family = "binomial")
summary(lg)
## 
## Call:
## glm(formula = churn ~ skill, family = "binomial", data = xvdt_skill_no_all_numeric_churn_t)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7910  -1.1685   0.6867   0.8428   1.8962  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.37938    0.02397   57.55   <2e-16 ***
## skill       -1.53701    0.03503  -43.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26006  on 19999  degrees of freedom
## Residual deviance: 23828  on 19998  degrees of freedom
## AIC: 23832
## 
## Number of Fisher Scoring iterations: 4
#prediction and accuracy
lg.probs <- predict(lg, xvdt_skill_no_all_numeric_churn_v, type="response")
lg.pred = rep("0", 8000)
lg.pred[lg.probs >.55]="1"
table(lg.pred)
## lg.pred
##    0    1 
## 1851 6149
act_churn <- xvdt_skill_no_all_numeric_churn_v$churn

mean(lg.pred==act_churn)
## [1] 0.674

We see that when we only fit our model with skill feature, we get a %67.4 accuracy which is pretty good for a single column, so we add this as a feature to our data.

skill <- xvdt$skill
x_skill <- cbind(x, skill)

quartiles <- quantile(x_skill$skill, probs=c(.25, .75), na.rm = FALSE)
  IQR <- IQR(x_skill$skill)
  
  Lower <- quartiles[1] - 1.5*IQR
  Upper <- quartiles[2] + 1.5*IQR
  
  x_skill <- subset(x_skill, x_skill$skill > Lower & x_skill$skill < Upper)

str(x_skill)
## 'data.frame':    28018 obs. of  24 variables:
##  $ X_id             : int  88464659 88466479 88468912 88471351 88473161 88474849 88476408 88478719 88480762 88482779 ...
##  $ country          : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
##  $ device_brand     : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
##  $ device_model     : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
##  $ session_cnt      : int  7 1 1 2 1 3 1 3 1 8 ...
##  $ session_length   : int  13162 1809 2141 1298 666 3514 503 1138 520 1888 ...
##  $ lang             : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
##  $ max_lvl_no       : int  13 9 28 14 12 20 9 23 9 1 ...
##  $ gameplay_duration: int  1088 1898 1685 1221 532 3018 393 747 382 1435 ...
##  $ bonus_cnt        : int  6 9 8 2 4 23 3 3 0 15 ...
##  $ hint1_cnt        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hint2_cnt        : int  2 1 0 0 1 0 0 0 0 0 ...
##  $ hint3_cnt        : int  0 0 0 0 0 0 0 0 0 2 ...
##  $ repeat_cnt       : int  7 10 21 7 10 30 4 1 1 38 ...
##  $ gold_cnt         : int  418 25 990 1360 1367 585 620 2141 665 500 ...
##  $ banner_cnt       : int  17 38 0 49 23 154 16 33 9 14 ...
##  $ is_cnt           : int  1 0 16 2 0 7 0 11 0 2 ...
##  $ rv_cnt           : int  0 1 1 1 1 3 0 1 1 4 ...
##  $ churn            : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ multi_lang       : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_ses_length   : int  12451 1809 2141 1248 666 1817 503 449 520 605 ...
##  $ median_ses_length: int  100 1809 2141 649 666 1609 503 372 520 193 ...
##  $ avg_ses_length   : int  1880 1809 2141 649 666 1171 503 379 520 236 ...
##  $ skill            : num  0.3717 0.0945 1.2133 0.3885 0.6412 ...
str(x)
## 'data.frame':    30000 obs. of  23 variables:
##  $ X_id             : int  88464659 88466479 88468912 88471351 88473161 88474849 88476408 88478719 88480762 88482779 ...
##  $ country          : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
##  $ device_brand     : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
##  $ device_model     : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
##  $ session_cnt      : int  7 1 1 2 1 3 1 3 1 8 ...
##  $ session_length   : int  13162 1809 2141 1298 666 3514 503 1138 520 1888 ...
##  $ lang             : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
##  $ max_lvl_no       : int  13 9 28 14 12 20 9 23 9 1 ...
##  $ gameplay_duration: int  1088 1898 1685 1221 532 3018 393 747 382 1435 ...
##  $ bonus_cnt        : int  6 9 8 2 4 23 3 3 0 15 ...
##  $ hint1_cnt        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hint2_cnt        : int  2 1 0 0 1 0 0 0 0 0 ...
##  $ hint3_cnt        : int  0 0 0 0 0 0 0 0 0 2 ...
##  $ repeat_cnt       : int  7 10 21 7 10 30 4 1 1 38 ...
##  $ gold_cnt         : int  418 25 990 1360 1367 585 620 2141 665 500 ...
##  $ banner_cnt       : int  17 38 0 49 23 154 16 33 9 14 ...
##  $ is_cnt           : int  1 0 16 2 0 7 0 11 0 2 ...
##  $ rv_cnt           : int  0 1 1 1 1 3 0 1 1 4 ...
##  $ churn            : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ multi_lang       : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_ses_length   : int  12451 1809 2141 1248 666 1817 503 449 520 605 ...
##  $ median_ses_length: int  100 1809 2141 649 666 1609 503 372 520 193 ...
##  $ avg_ses_length   : int  1880 1809 2141 649 666 1171 503 379 520 236 ...

2.1 Reducing cardinality of categorical features

Many of our categorical features have high cardinality (have many unique factor levels) which makes it hard for many models to learn from. For example, if we try to create a decision tree with all our features, we will get an error that says that it only supports factors with levels < 32 (I got this but I don’t include the code here).

We can reduce the cardinality of our features by simply taking all the levels that occur few times in the data and combine them in another category “other”.

Country

x_skill_reduce <- x_skill

country_column <- x_skill_reduce$country

# get the frequency of each level using the table() function
level_counts <- table(country_column)

# create a vector of levels that occur less than 300 times
levels_to_relevel <- names(level_counts)[level_counts < 300]

# create a new factor column with the updated levels using the ifelse() function
for (level in levels_to_relevel) {
updated_country_column <- factor(ifelse(country_column %in% levels_to_relevel, "Other", as.character(country_column)))
}

# view the updated levels of the factor column
table(updated_country_column)
## updated_country_column
##    BG    BR    CZ    FR    GR    HU    IT    LT    MX    NL Other    PL    RO 
##  2173  1871  1399  2022  2726  1310  2911  1083   541   770  2072  1730  3431 
##    SE    SK    TR 
##   823  1035  2121

As we can see, the updated_factor_column is exactly what we want with 15 categories are exactly the same but other categories are combined in “other” category.

Device_Brand

x_skill_reduce <- x_skill

device_brand_column <- x_skill_reduce$device_brand

# get the frequency of each level using the table() function
level_counts <- table(device_brand_column)

# create a vector of levels that occur less than 300 times
levels_to_relevel <- names(level_counts)[level_counts < 300]

# create a new factor column with the updated levels using the ifelse() function
for (level in levels_to_relevel) {
updated_device_brand_column <- factor(ifelse(device_brand_column %in% levels_to_relevel, "Other", as.character(device_brand_column)))
}

# view the updated levels of the factor column
table(updated_device_brand_column)
## updated_device_brand_column
##   HUAWEI motorola     OPPO    Other  samsung   Xiaomi 
##     5270     1159      821     2277    12617     5874

Device_Model

x_skill_reduce <- x_skill

device_model_column <- x_skill_reduce$device_model

# get the frequency of each level using the table() function
level_counts <- table(device_model_column)

# create a vector of levels that occur less than 300 times
levels_to_relevel <- names(level_counts)[level_counts < 215]

# create a new factor column with the updated levels using the ifelse() function
for (level in levels_to_relevel) {
updated_device_model_column <- factor(ifelse(device_model_column %in% levels_to_relevel, "Other", as.character(device_model_column)))
}

# view the updated levels of the factor column
table(updated_device_model_column)
## updated_device_model_column
##                           ANE-LX1       M2003J15SC        M2004J19C 
##              280              381              530              367 
##         MAR-LX1A            Other          POT-LX1     Redmi Note 7 
##              742            17107              386              220 
##     Redmi Note 8 Redmi Note 8 Pro    Redmi Note 8T Redmi Note 9 Pro 
##              229              433              291              501 
##        SM-A105FN         SM-A125F         SM-A127F         SM-A202F 
##              323              317              219              501 
##         SM-A217F        SM-A405FN        SM-A505FN         SM-A515F 
##              529              270              432              780 
##         SM-A528B        SM-A705FN         SM-A715F         SM-G950F 
##              219              267              515              253 
##         SM-G960F         SM-G965F         SM-G973F         SM-G975F 
##              266              224              334              217 
##         SM-G991B          SNE-LX1          VOG-L29 
##              254              266              365
nlevels(updated_device_model_column)
## [1] 31

Language

x_skill_reduce <- x_skill

lang_column <- x_skill_reduce$lang

# get the frequency of each level using the table() function
level_counts <- table(lang_column)

# create a vector of levels that occur less than 300 times
levels_to_relevel <- names(level_counts)[level_counts < 215]

# create a new factor column with the updated levels using the ifelse() function
for (level in levels_to_relevel) {
updated_lang_column <- factor(ifelse(lang_column %in% levels_to_relevel, "Other", as.character(lang_column)))
}

# view the updated levels of the factor column
table(updated_lang_column)
## updated_lang_column
##    BR    BU    CS    ES    FR    GR    HU    IT    LT    NL Other    PL    RO 
##  1858  2198  1391  1079  2025  2725  1374  2892  1112   815  1105  1754  3481 
##    RU    SE    SK    TR 
##   262   802  1020  2125
nlevels(updated_lang_column)
## [1] 17

x_skill_reduced -> dataset with no more than 32 factor levels.

We can now replace our old categorical columns in our data with new columns we created:

x_skill_reduced <- x_skill

x_skill_reduced$country <- updated_country_column
x_skill_reduced$device_brand <- updated_device_brand_column
x_skill_reduced$device_model <- updated_device_model_column
x_skill_reduced$lang <- updated_lang_column

str(x_skill_reduced)
## 'data.frame':    28018 obs. of  24 variables:
##  $ X_id             : int  88464659 88466479 88468912 88471351 88473161 88474849 88476408 88478719 88480762 88482779 ...
##  $ country          : Factor w/ 16 levels "BG","BR","CZ",..: 7 2 13 8 13 2 4 13 5 10 ...
##  $ device_brand     : Factor w/ 6 levels "HUAWEI","motorola",..: 5 6 1 1 1 6 6 5 5 5 ...
##  $ device_model     : Factor w/ 31 levels "","ANE-LX1","M2003J15SC",..: 14 6 6 30 6 6 9 6 6 20 ...
##  $ session_cnt      : int  7 1 1 2 1 3 1 3 1 8 ...
##  $ session_length   : int  13162 1809 2141 1298 666 3514 503 1138 520 1888 ...
##  $ lang             : Factor w/ 17 levels "BR","BU","CS",..: 8 1 13 9 13 1 5 13 6 10 ...
##  $ max_lvl_no       : int  13 9 28 14 12 20 9 23 9 1 ...
##  $ gameplay_duration: int  1088 1898 1685 1221 532 3018 393 747 382 1435 ...
##  $ bonus_cnt        : int  6 9 8 2 4 23 3 3 0 15 ...
##  $ hint1_cnt        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hint2_cnt        : int  2 1 0 0 1 0 0 0 0 0 ...
##  $ hint3_cnt        : int  0 0 0 0 0 0 0 0 0 2 ...
##  $ repeat_cnt       : int  7 10 21 7 10 30 4 1 1 38 ...
##  $ gold_cnt         : int  418 25 990 1360 1367 585 620 2141 665 500 ...
##  $ banner_cnt       : int  17 38 0 49 23 154 16 33 9 14 ...
##  $ is_cnt           : int  1 0 16 2 0 7 0 11 0 2 ...
##  $ rv_cnt           : int  0 1 1 1 1 3 0 1 1 4 ...
##  $ churn            : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ multi_lang       : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_ses_length   : int  12451 1809 2141 1248 666 1817 503 449 520 605 ...
##  $ median_ses_length: int  100 1809 2141 649 666 1609 503 372 520 193 ...
##  $ avg_ses_length   : int  1880 1809 2141 649 666 1171 503 379 520 236 ...
##  $ skill            : num  0.3717 0.0945 1.2133 0.3885 0.6412 ...

3. Models

About inference, there are f1 scores and AUC or AUCPR metrics on the results of the models that I fit and so in the report, but I take accuracy in consideration the most since it is the final object.

3.1 Choosing Models

I used logistic regression to gain insight on numerical features in previous sections.

Because we don’t have a dimensionality issue (30k rows with ~25 columns), I will not use any dimension reduction methods.

I will also not use any unsupervised learning methods because our aim is to do predictions on a binary classification problems.

Because we have a binary classification problem with limited and non-linear data with categorical features that has high cardinality at hand, I will use advanced tree based methods like Random Forest, Gradient Boosting, packages like XGBoost and AutoML to do predictions on the test set.

I also tried many models outside this report, but none of them could even come close to advanced tree based methods, so I’m not including them in the report.

3.2 Fitting & Evaluating Models

I created a decision tree only for visualization and getting insight purposes, but didn’t get much insight.

library(tree)
tree_model <- tree(churn~., data = x_skill_reduced)
plot(tree_model) 
text(tree_model)

Random Forest

Fit model:

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
x_skill_reduced_train <- x_skill_reduced[1:20000,]
x_skill_reduced_test <- x_skill_reduced[20001:28000,]

rf.fit <- randomForest(churn~.,data=x_skill_reduced_train, importance = TRUE, ntree = 500, mtry = 5)
rf.fit
## 
## Call:
##  randomForest(formula = churn ~ ., data = x_skill_reduced_train,      importance = TRUE, ntree = 500, mtry = 5) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 19.18%
## Confusion matrix:
##      0     1 class.error
## 0 4278  2811  0.39652983
## 1 1025 11886  0.07938967

Predict:

predictions <- predict(rf.fit, x_skill_reduced_test)

library(caret)
## Loading required package: lattice
accuracy_table <- confusionMatrix(predictions, x_skill_reduced_test$churn)
accuracy_table
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1632  409
##          1 1167 4792
##                                           
##                Accuracy : 0.803           
##                  95% CI : (0.7941, 0.8117)
##     No Information Rate : 0.6501          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5381          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5831          
##             Specificity : 0.9214          
##          Pos Pred Value : 0.7996          
##          Neg Pred Value : 0.8042          
##              Prevalence : 0.3499          
##          Detection Rate : 0.2040          
##    Detection Prevalence : 0.2551          
##       Balanced Accuracy : 0.7522          
##                                           
##        'Positive' Class : 0               
## 
importance(rf.fit)
##                            0           1 MeanDecreaseAccuracy MeanDecreaseGini
## X_id              108.937222 201.7210424           201.385506      1331.238112
## country            53.735106 -18.5461787            23.898107       386.832844
## device_brand        9.272208  -0.8258751             5.620608       183.759528
## device_model       31.616220   0.6672397            21.125656       618.084458
## session_cnt        47.984465  38.7050698            53.143727       493.672527
## session_length     18.477794  38.7461056            47.258828       350.929038
## lang               54.259147 -14.0314779            26.067932       406.604438
## max_lvl_no         33.866902  34.5110387            51.574454       507.842320
## gameplay_duration  19.033711  44.0672055            54.550536       356.411473
## bonus_cnt           4.774360  37.5327159            40.604553       235.960713
## hint1_cnt           4.673181  -4.8181425            -1.831629         7.687847
## hint2_cnt          11.220287  15.6530185            20.385167       133.705424
## hint3_cnt          65.350676  52.4189577            82.525600       652.255514
## repeat_cnt          5.926775  39.1738887            43.334983       268.899548
## gold_cnt            5.175028  32.0826281            32.786893       351.441040
## banner_cnt         11.515037  39.2345321            45.871689       313.970802
## is_cnt             19.169001  24.3696516            36.399276       221.374463
## rv_cnt             -2.421731  34.0238136            31.976261       185.751573
## multi_lang         19.370295   5.6285204            18.220602        20.296478
## max_ses_length     15.164519  39.9997141            43.242556       337.411974
## median_ses_length  30.441889  31.2255973            44.852951       511.919290
## avg_ses_length     27.805952  33.3386544            48.980779       431.217184
## skill              36.678082  44.5458276            65.968913       845.824077

We got %80 accuracy on our test set.

When we look at the variable importances, it looks like the ID column is the most important by far followed by hint3_cnt and skill. Device brand and hint1_cnt has very little importance.

library(randomForest)

x_skill_reduced_train <- x_skill_reduced[1:20000,]
x_skill_reduced_test <- x_skill_reduced[20001:28000,]

rf.fit_2 <- randomForest(churn~.,data=x_skill_reduced_train, importance = TRUE, ntree = 1000)
rf.fit_2
## 
## Call:
##  randomForest(formula = churn ~ ., data = x_skill_reduced_train,      importance = TRUE, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 19%
## Confusion matrix:
##      0     1 class.error
## 0 4269  2820  0.39779941
## 1  981 11930  0.07598172
predictions <- predict(rf.fit_2, x_skill_reduced_test)

accuracy_table <- confusionMatrix(predictions, x_skill_reduced_test$churn)
accuracy_table
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1613  403
##          1 1186 4798
##                                           
##                Accuracy : 0.8014          
##                  95% CI : (0.7925, 0.8101)
##     No Information Rate : 0.6501          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5332          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5763          
##             Specificity : 0.9225          
##          Pos Pred Value : 0.8001          
##          Neg Pred Value : 0.8018          
##              Prevalence : 0.3499          
##          Detection Rate : 0.2016          
##    Detection Prevalence : 0.2520          
##       Balanced Accuracy : 0.7494          
##                                           
##        'Positive' Class : 0               
## 
importance(rf.fit_2)
##                            0           1 MeanDecreaseAccuracy MeanDecreaseGini
## X_id              140.656586 226.8747294           234.938766      1247.821531
## country            73.056179 -29.1760630            31.796699       393.775162
## device_brand       11.365478   0.3168308             7.741574       187.721560
## device_model       40.554830   4.4925497            29.427952       593.996796
## session_cnt        65.131916  50.1698792            68.259587       478.913233
## session_length     25.023837  56.9788623            69.704869       366.869690
## lang               72.766187 -21.5350368            34.427194       410.006983
## max_lvl_no         49.240974  47.0332401            73.670509       516.816662
## gameplay_duration  25.553157  54.3552314            73.537160       366.000575
## bonus_cnt           9.291245  46.8951057            53.902685       243.397784
## hint1_cnt           7.390113  -8.8259221            -3.877599         7.780791
## hint2_cnt          13.769559  23.1082666            28.279376       141.455192
## hint3_cnt          88.916977  67.1273185           106.056852       645.282026
## repeat_cnt          9.204935  54.6857487            61.274740       276.610455
## gold_cnt           10.509182  42.4438193            43.328677       356.726603
## banner_cnt         15.804404  50.1201949            63.227288       322.635815
## is_cnt             28.205529  33.6783859            49.539047       236.534822
## rv_cnt             -3.611371  44.4985424            41.252114       187.754332
## multi_lang         26.236048   8.8517091            25.593072        20.524749
## max_ses_length     20.541980  57.7586417            62.593033       356.284638
## median_ses_length  44.384494  44.9523424            66.203980       515.638682
## avg_ses_length     36.370464  47.2171876            67.278968       452.295258
## skill              50.490246  62.5563717            93.163265       820.883955

When we fit random forest using more trees, we see a negative impact of hint1_cnt on the model, which might mean it may improve the accuracy if we would remove this column. I got this result when I fit with 2500 trees.

Boosting

I tried to fit gbm with different distributions but failed. Multinomial gave a warning that save it was broken right now, and it couldn’t do meaningful predictions. Bernoulli failed to fit. I also tried changing the churn column to numeric to fit bernoulli but it also failed.

This is why I will fit gradient boosting with h2o.

Install h2o and prepare the data for h2o:

library(h2o)
## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
## 
## ----------------------------------------------------------------------
## 
## Attaching package: 'h2o'
## The following objects are masked from 'package:data.table':
## 
##     hour, month, week, year
## The following objects are masked from 'package:stats':
## 
##     cor, sd, var
## The following objects are masked from 'package:base':
## 
##     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc
h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         53 minutes 57 seconds 
##     H2O cluster timezone:       Europe/Istanbul 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.38.0.1 
##     H2O cluster version age:    3 months and 13 days !!! 
##     H2O cluster name:           H2O_started_from_R_erkinyilmaz_yxq635 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.17 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 4.2.1 (2022-06-23)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (3 months and 13 days)!
## Please download and install the latest version from http://h2o.ai/download/
x_skill_reduced_h2o <- as.h2o(x_skill_reduced)
## Warning in use.package("data.table"): data.table cannot be used without R
## package bit64 version 0.9.7 or higher. Please upgrade to take advangage of
## data.table speedups.
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
x_skill_reduced_h2o_train <- x_skill_reduced_h2o[1:20000,]
x_skill_reduced_h2o_test <- x_skill_reduced_h2o[20001:28000,]

Fit model:

# Get the names of all of the columns in the data
column_names <- names(x_skill_reduced_h2o)

# Remove the response variable from the list of column names
predictor_variables <- column_names[column_names != "churn"]

gbm.fit <- h2o.gbm(x = predictor_variables,
                 y = "churn",
                 training_frame = x_skill_reduced_h2o_train,
                 ntrees = 100,
                 max_depth = 5,
                 distribution = "bernoulli")
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |======================================================================| 100%
gbm.fit
## Model Details:
## ==============
## 
## H2OBinomialModel: gbm
## Model ID:  GBM_model_R_1672583265719_14612 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             100                      100               41385         5
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         5    5.00000         11         32    28.26000
## 
## 
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.1112661
## RMSE:  0.3335657
## LogLoss:  0.3632016
## Mean Per-Class Error:  0.1856426
## AUC:  0.9115125
## AUCPR:  0.9428195
## Gini:  0.8230249
## R^2:  0.5137296
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0     1    Error         Rate
## 0      4933  2156 0.304133   =2156/7089
## 1       867 12044 0.067152   =867/12911
## Totals 5800 14200 0.151150  =3023/20000
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold        value idx
## 1                       max f1  0.495741     0.888495 224
## 2                       max f2  0.231996     0.934676 309
## 3                 max f0point5  0.710989     0.884733 148
## 4                 max accuracy  0.519646     0.849200 217
## 5                max precision  0.982826     1.000000   0
## 6                   max recall  0.050175     1.000000 388
## 7              max specificity  0.982826     1.000000   0
## 8             max absolute_mcc  0.519646     0.663772 217
## 9   max min_per_class_accuracy  0.688043     0.829680 157
## 10 max mean_per_class_accuracy  0.670376     0.830743 165
## 11                     max tns  0.982826  7089.000000   0
## 12                     max fns  0.982826 12880.000000   0
## 13                     max fps  0.012078  7089.000000 399
## 14                     max tps  0.050175 12911.000000 388
## 15                     max tnr  0.982826     1.000000   0
## 16                     max fnr  0.982826     0.997599   0
## 17                     max fpr  0.012078     1.000000 399
## 18                     max tpr  0.050175     1.000000 388
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.varimp_plot(gbm.fit)

gbm model picks “skill” feature that we created as the most important feature among predictors.

Predictions and accuracy:

h2o.performance(gbm.fit, x_skill_reduced_h2o_test)
## H2OBinomialMetrics: gbm
## 
## MSE:  0.1437507
## RMSE:  0.3791447
## LogLoss:  0.4530854
## Mean Per-Class Error:  0.258661
## AUC:  0.8381856
## AUCPR:  0.8787315
## Gini:  0.6763713
## R^2:  0.3680245
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error        Rate
## 0      1520 1279 0.456949  =1279/2799
## 1       314 4887 0.060373   =314/5201
## Totals 1834 6166 0.199125  =1593/8000
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.385004    0.859857 265
## 2                       max f2  0.202673    0.922341 323
## 3                 max f0point5  0.662970    0.842887 174
## 4                 max accuracy  0.474420    0.803875 237
## 5                max precision  0.988740    1.000000   0
## 6                   max recall  0.030222    1.000000 395
## 7              max specificity  0.988740    1.000000   0
## 8             max absolute_mcc  0.474420    0.554598 237
## 9   max min_per_class_accuracy  0.708842    0.765273 156
## 10 max mean_per_class_accuracy  0.662970    0.773858 174
## 11                     max tns  0.988740 2799.000000   0
## 12                     max fns  0.988740 5200.000000   0
## 13                     max fps  0.007193 2799.000000 399
## 14                     max tps  0.030222 5201.000000 395
## 15                     max tnr  0.988740    1.000000   0
## 16                     max fnr  0.988740    0.999808   0
## 17                     max fpr  0.007193    1.000000 399
## 18                     max tpr  0.030222    1.000000 395
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
1-(1593/8000) #our accuracy
## [1] 0.800875

We get %80 accuracy. (We calculate this from the confusion matrix)

Now let’s try it with the model that is not reduced in its cardinality:

x_skill_h2o <- as.h2o(x_skill)
## Warning in use.package("data.table"): data.table cannot be used without R
## package bit64 version 0.9.7 or higher. Please upgrade to take advangage of
## data.table speedups.
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
x_skill_h2o_train <- x_skill_h2o[1:20000,]
x_skill_h2o_test <- x_skill_h2o[20001:28000,]

Fit model:

# Get the names of all of the columns in the data
column_names <- names(x_skill_h2o)

# Remove the response variable from the list of column names
predictor_variables <- column_names[column_names != "churn"]

gbm.fit <- h2o.gbm(x = predictor_variables,
                 y = "churn",
                 training_frame = x_skill_h2o_train,
                 ntrees = 100,
                 learn_rate = 0.1,
                 max_depth = 5,
                 distribution = "bernoulli")
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
gbm.fit
## Model Details:
## ==============
## 
## H2OBinomialModel: gbm
## Model ID:  GBM_model_R_1672583265719_14701 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             100                      100               61467         5
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         5    5.00000         12         32    28.32000
## 
## 
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.1047529
## RMSE:  0.3236555
## LogLoss:  0.3456253
## Mean Per-Class Error:  0.1766625
## AUC:  0.9206019
## AUCPR:  0.9477233
## Gini:  0.8412037
## R^2:  0.5421943
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0     1    Error         Rate
## 0      4984  2105 0.296939   =2105/7089
## 1       728 12183 0.056386   =728/12911
## Totals 5712 14288 0.141650  =2833/20000
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold        value idx
## 1                       max f1  0.481337     0.895842 232
## 2                       max f2  0.286516     0.939697 292
## 3                 max f0point5  0.738911     0.890572 141
## 4                 max accuracy  0.511068     0.858650 223
## 5                max precision  0.985792     1.000000   0
## 6                   max recall  0.053208     1.000000 384
## 7              max specificity  0.985792     1.000000   0
## 8             max absolute_mcc  0.490417     0.685377 229
## 9   max min_per_class_accuracy  0.696010     0.836341 158
## 10 max mean_per_class_accuracy  0.650240     0.840662 175
## 11                     max tns  0.985792  7089.000000   0
## 12                     max fns  0.985792 12894.000000   0
## 13                     max fps  0.013926  7089.000000 399
## 14                     max tps  0.053208 12911.000000 384
## 15                     max tnr  0.985792     1.000000   0
## 16                     max fnr  0.985792     0.998683   0
## 17                     max fpr  0.013926     1.000000 399
## 18                     max tpr  0.053208     1.000000 384
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.varimp_plot(gbm.fit)

Accuracy:

h2o.performance(gbm.fit, x_skill_h2o_test)
## H2OBinomialMetrics: gbm
## 
## MSE:  0.1418578
## RMSE:  0.3766401
## LogLoss:  0.4482721
## Mean Per-Class Error:  0.2511426
## AUC:  0.8405864
## AUCPR:  0.8780414
## Gini:  0.6811728
## R^2:  0.3763465
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error        Rate
## 0      1582 1217 0.434798  =1217/2799
## 1       351 4850 0.067487   =351/5201
## Totals 1933 6067 0.196000  =1568/8000
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.418627    0.860845 257
## 2                       max f2  0.170998    0.923776 333
## 3                 max f0point5  0.657212    0.844406 174
## 4                 max accuracy  0.515190    0.807375 226
## 5                max precision  0.966656    0.940678   6
## 6                   max recall  0.031235    1.000000 393
## 7              max specificity  0.982469    0.999285   0
## 8             max absolute_mcc  0.543224    0.564413 217
## 9   max min_per_class_accuracy  0.713304    0.765631 152
## 10 max mean_per_class_accuracy  0.624384    0.776350 187
## 11                     max tns  0.982469 2797.000000   0
## 12                     max fns  0.982469 5196.000000   0
## 13                     max fps  0.006840 2799.000000 399
## 14                     max tps  0.031235 5201.000000 393
## 15                     max tnr  0.982469    0.999285   0
## 16                     max fnr  0.982469    0.999039   0
## 17                     max fpr  0.006840    1.000000 399
## 18                     max tpr  0.031235    1.000000 393
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Our accuracy increased by a very small amount.

XGBoost

xgb.fit <- h2o.xgboost(x = predictor_variables, y = "churn", training_frame = x_skill_h2o_train)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |======================================================================| 100%
xgb.fit
## Model Details:
## ==============
## 
## H2OBinomialModel: xgboost
## Model ID:  XGBoost_model_R_1672583265719_14804 
## Model Summary: 
##   number_of_trees
## 1              50
## 
## 
## H2OBinomialMetrics: xgboost
## ** Reported on training data. **
## 
## MSE:  0.1080628
## RMSE:  0.3287291
## LogLoss:  0.3516744
## Mean Per-Class Error:  0.1852898
## AUC:  0.9183564
## AUCPR:  0.9494955
## Gini:  0.8367128
## R^2:  0.5277288
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0     1    Error         Rate
## 0      4910  2179 0.307378   =2179/7089
## 1       816 12095 0.063202   =816/12911
## Totals 5726 14274 0.149750  =2995/20000
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold        value idx
## 1                       max f1  0.492955     0.889829 232
## 2                       max f2  0.278149     0.935720 299
## 3                 max f0point5  0.694774     0.890677 160
## 4                 max accuracy  0.548336     0.851550 214
## 5                max precision  0.987350     1.000000   0
## 6                   max recall  0.039872     1.000000 388
## 7              max specificity  0.987350     1.000000   0
## 8             max absolute_mcc  0.555534     0.670119 212
## 9   max min_per_class_accuracy  0.680782     0.836648 167
## 10 max mean_per_class_accuracy  0.670212     0.837841 171
## 11                     max tns  0.987350  7089.000000   0
## 12                     max fns  0.987350 12893.000000   0
## 13                     max fps  0.007559  7089.000000 399
## 14                     max tps  0.039872 12911.000000 388
## 15                     max tnr  0.987350     1.000000   0
## 16                     max fnr  0.987350     0.998606   0
## 17                     max fpr  0.007559     1.000000 399
## 18                     max tpr  0.039872     1.000000 388
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.performance(xgb.fit, x_skill_h2o_test) 
## H2OBinomialMetrics: xgboost
## 
## MSE:  0.1412545
## RMSE:  0.3758384
## LogLoss:  0.4453738
## Mean Per-Class Error:  0.2459472
## AUC:  0.8455429
## AUCPR:  0.8866394
## Gini:  0.6910858
## R^2:  0.3789987
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error        Rate
## 0      1624 1175 0.419793  =1175/2799
## 1       375 4826 0.072102   =375/5201
## Totals 1999 6001 0.193750  =1550/8000
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.431566    0.861632 256
## 2                       max f2  0.202383    0.922765 325
## 3                 max f0point5  0.686181    0.843017 164
## 4                 max accuracy  0.524701    0.807500 224
## 5                max precision  0.987680    1.000000   0
## 6                   max recall  0.029208    1.000000 393
## 7              max specificity  0.987680    1.000000   0
## 8             max absolute_mcc  0.524701    0.565122 224
## 9   max min_per_class_accuracy  0.708116    0.767352 155
## 10 max mean_per_class_accuracy  0.606980    0.775497 193
## 11                     max tns  0.987680 2799.000000   0
## 12                     max fns  0.987680 5194.000000   0
## 13                     max fps  0.006029 2799.000000 399
## 14                     max tps  0.029208 5201.000000 393
## 15                     max tnr  0.987680    1.000000   0
## 16                     max fnr  0.987680    0.998654   0
## 17                     max fpr  0.006029    1.000000 399
## 18                     max tpr  0.029208    1.000000 393
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
1-(1550/8000)
## [1] 0.80625

We get slightly better accuracy (still smaller than %81). I will try with the reduced data.

xgb.fit <- h2o.xgboost(x = predictor_variables, y = "churn", training_frame = x_skill_reduced_h2o_train)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |======================================================================| 100%
xgb.fit
## Model Details:
## ==============
## 
## H2OBinomialModel: xgboost
## Model ID:  XGBoost_model_R_1672583265719_14852 
## Model Summary: 
##   number_of_trees
## 1              50
## 
## 
## H2OBinomialMetrics: xgboost
## ** Reported on training data. **
## 
## MSE:  0.09636587
## RMSE:  0.3104285
## LogLoss:  0.3194093
## Mean Per-Class Error:  0.1515705
## AUC:  0.9377671
## AUCPR:  0.9628305
## Gini:  0.8755342
## R^2:  0.5788485
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0     1    Error         Rate
## 0      5460  1629 0.229793   =1629/7089
## 1       947 11964 0.073348   =947/12911
## Totals 6407 13593 0.128800  =2576/20000
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold        value idx
## 1                       max f1  0.551470     0.902807 213
## 2                       max f2  0.283838     0.940270 296
## 3                 max f0point5  0.710059     0.907574 154
## 4                 max accuracy  0.578007     0.872050 204
## 5                max precision  0.992283     1.000000   0
## 6                   max recall  0.035310     1.000000 387
## 7              max specificity  0.992283     1.000000   0
## 8             max absolute_mcc  0.589274     0.718040 200
## 9   max min_per_class_accuracy  0.674724     0.859887 168
## 10 max mean_per_class_accuracy  0.634102     0.861320 185
## 11                     max tns  0.992283  7089.000000   0
## 12                     max fns  0.992283 12899.000000   0
## 13                     max fps  0.006358  7089.000000 399
## 14                     max tps  0.035310 12911.000000 387
## 15                     max tnr  0.992283     1.000000   0
## 16                     max fnr  0.992283     0.999071   0
## 17                     max fpr  0.006358     1.000000 399
## 18                     max tpr  0.035310     1.000000 387
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.performance(xgb.fit, x_skill_reduced_h2o_test) 
## H2OBinomialMetrics: xgboost
## 
## MSE:  0.1438475
## RMSE:  0.3792723
## LogLoss:  0.453158
## Mean Per-Class Error:  0.2595262
## AUC:  0.8400971
## AUCPR:  0.8815367
## Gini:  0.6801941
## R^2:  0.3675991
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error        Rate
## 0      1520 1279 0.456949  =1279/2799
## 1       323 4878 0.062103   =323/5201
## Totals 1843 6157 0.200250  =1602/8000
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.374642    0.858954 269
## 2                       max f2  0.125937    0.923370 346
## 3                 max f0point5  0.647444    0.842209 183
## 4                 max accuracy  0.453353    0.802750 245
## 5                max precision  0.990723    1.000000   0
## 6                   max recall  0.018043    1.000000 395
## 7              max specificity  0.990723    1.000000   0
## 8             max absolute_mcc  0.466034    0.551566 241
## 9   max min_per_class_accuracy  0.717769    0.763487 155
## 10 max mean_per_class_accuracy  0.647444    0.773594 183
## 11                     max tns  0.990723 2799.000000   0
## 12                     max fns  0.990723 5190.000000   0
## 13                     max fps  0.002912 2799.000000 399
## 14                     max tps  0.018043 5201.000000 395
## 15                     max tnr  0.990723    1.000000   0
## 16                     max fnr  0.990723    0.997885   0
## 17                     max fpr  0.002912    1.000000 399
## 18                     max tpr  0.018043    1.000000 395
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

We got slightly worse accuracy.

AutoML

Finally, I will try AutoML.

automl.fit <- h2o.automl(
  x = predictor_variables, 
  y = "churn", 
  training_frame = x_skill_h2o_train, 
  max_models = 10, 
  seed = 1)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |======================================================================| 100%
aml <- automl.fit
lb <- aml@leaderboard
print(lb, n = nrow(lb))
##                                                   model_id       auc   logloss
## 1     StackedEnsemble_AllModels_1_AutoML_9_20230101_182207 0.8575123 0.4301484
## 2  StackedEnsemble_BestOfFamily_1_AutoML_9_20230101_182207 0.8563241 0.4322788
## 3                       XGBoost_3_AutoML_9_20230101_182207 0.8522842 0.4375503
## 4                           GBM_2_AutoML_9_20230101_182207 0.8502435 0.4407005
## 5                           GBM_1_AutoML_9_20230101_182207 0.8486417 0.4419708
## 6                           GBM_3_AutoML_9_20230101_182207 0.8466284 0.4441636
## 7                           GBM_4_AutoML_9_20230101_182207 0.8424572 0.4522268
## 8                       XGBoost_2_AutoML_9_20230101_182207 0.8408336 0.4619994
## 9                       XGBoost_1_AutoML_9_20230101_182207 0.8349079 0.4713581
## 10                          XRT_1_AutoML_9_20230101_182207 0.8310142 0.4821717
## 11                          DRF_1_AutoML_9_20230101_182207 0.8171732 0.4953259
## 12                          GLM_1_AutoML_9_20230101_182207 0.7730249 0.5436575
##        aucpr mean_per_class_error      rmse       mse
## 1  0.8958364            0.2449691 0.3686387 0.1358945
## 2  0.8944346            0.2410902 0.3696474 0.1366392
## 3  0.8907117            0.2401100 0.3720752 0.1384400
## 4  0.8893884            0.2345314 0.3735160 0.1395142
## 5  0.8879666            0.2392166 0.3740593 0.1399203
## 6  0.8855012            0.2459748 0.3748629 0.1405222
## 7  0.8832669            0.2406388 0.3788148 0.1435006
## 8  0.8811776            0.2451194 0.3813694 0.1454426
## 9  0.8766028            0.2617947 0.3853821 0.1485194
## 10 0.8763051            0.2617249 0.3937005 0.1550000
## 11 0.8697120            0.2750101 0.4009410 0.1607536
## 12 0.8392896            0.3535085 0.4242718 0.1800065
## 
## [12 rows x 7 columns]
h2o.performance(aml@leader, x_skill_h2o_test)
## H2OBinomialMetrics: stackedensemble
## 
## MSE:  0.1394035
## RMSE:  0.3733678
## LogLoss:  0.4401468
## Mean Per-Class Error:  0.2513921
## AUC:  0.8472375
## AUCPR:  0.8869587
## Gini:  0.6944751
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error        Rate
## 0      1558 1241 0.443373  =1241/2799
## 1       309 4892 0.059412   =309/5201
## Totals 1867 6133 0.193750  =1550/8000
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.393993    0.863243 267
## 2                       max f2  0.179430    0.924760 332
## 3                 max f0point5  0.655320    0.847608 177
## 4                 max accuracy  0.509947    0.809750 229
## 5                max precision  0.965852    0.969697   4
## 6                   max recall  0.045866    1.000000 390
## 7              max specificity  0.977023    0.999643   0
## 8             max absolute_mcc  0.509947    0.569162 229
## 9   max min_per_class_accuracy  0.710279    0.769852 155
## 10 max mean_per_class_accuracy  0.649793    0.780958 179
## 11                     max tns  0.977023 2798.000000   0
## 12                     max fns  0.977023 5196.000000   0
## 13                     max fps  0.012676 2799.000000 399
## 14                     max tps  0.045866 5201.000000 390
## 15                     max tnr  0.977023    0.999643   0
## 16                     max fnr  0.977023    0.999039   0
## 17                     max fpr  0.012676    1.000000 399
## 18                     max tpr  0.045866    1.000000 390
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
1-(1550/8000)
## [1] 0.80625

We get %80.6 accuracy.

I will fit the same model to not reduced data set.

automl.fit <- h2o.automl(
  x = predictor_variables, 
  y = "churn", 
  training_frame = x_skill_reduced_h2o_train, 
  max_models = 10, 
  seed = 1)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |======================================================================| 100%
aml <- automl.fit
lb <- aml@leaderboard
print(lb, n = nrow(lb))
##                                                    model_id       auc   logloss
## 1     StackedEnsemble_AllModels_1_AutoML_10_20230101_182434 0.8543237 0.4337217
## 2  StackedEnsemble_BestOfFamily_1_AutoML_10_20230101_182434 0.8534913 0.4351727
## 3                       XGBoost_3_AutoML_10_20230101_182434 0.8499165 0.4399922
## 4                           GBM_2_AutoML_10_20230101_182434 0.8477287 0.4434295
## 5                           GBM_1_AutoML_10_20230101_182434 0.8462881 0.4448453
## 6                           GBM_3_AutoML_10_20230101_182434 0.8440609 0.4476523
## 7                           GBM_4_AutoML_10_20230101_182434 0.8408652 0.4537178
## 8                       XGBoost_2_AutoML_10_20230101_182434 0.8369381 0.4676053
## 9                       XGBoost_1_AutoML_10_20230101_182434 0.8345407 0.4720734
## 10                          XRT_1_AutoML_10_20230101_182434 0.8298986 0.4834764
## 11                          DRF_1_AutoML_10_20230101_182434 0.8188427 0.4936558
## 12                          GLM_1_AutoML_10_20230101_182434 0.7695699 0.5463848
##        aucpr mean_per_class_error      rmse       mse
## 1  0.8932174            0.2427178 0.3701304 0.1369965
## 2  0.8922663            0.2483295 0.3707452 0.1374520
## 3  0.8889088            0.2468598 0.3727142 0.1389159
## 4  0.8876163            0.2490767 0.3748870 0.1405403
## 5  0.8868204            0.2437526 0.3756440 0.1411084
## 6  0.8837560            0.2619959 0.3767716 0.1419568
## 7  0.8807047            0.2553473 0.3795248 0.1440390
## 8  0.8780063            0.2678502 0.3839408 0.1474106
## 9  0.8765047            0.2556476 0.3856724 0.1487432
## 10 0.8761837            0.2670490 0.3946381 0.1557392
## 11 0.8713483            0.2873012 0.4003316 0.1602654
## 12 0.8369895            0.3532583 0.4250694 0.1806840
## 
## [12 rows x 7 columns]
h2o.performance(aml@leader, x_skill_reduced_h2o_test)
## H2OBinomialMetrics: stackedensemble
## 
## MSE:  0.1412805
## RMSE:  0.375873
## LogLoss:  0.4458455
## Mean Per-Class Error:  0.2447376
## AUC:  0.8428508
## AUCPR:  0.8818931
## Gini:  0.6857017
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error        Rate
## 0      1634 1165 0.416220  =1165/2799
## 1       381 4820 0.073255   =381/5201
## Totals 2015 5985 0.193250  =1546/8000
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.443758    0.861792 249
## 2                       max f2  0.144687    0.923757 344
## 3                 max f0point5  0.647552    0.843305 179
## 4                 max accuracy  0.457275    0.807500 244
## 5                max precision  0.978902    1.000000   0
## 6                   max recall  0.038369    1.000000 394
## 7              max specificity  0.978902    1.000000   0
## 8             max absolute_mcc  0.457275    0.562705 244
## 9   max min_per_class_accuracy  0.707354    0.765988 154
## 10 max mean_per_class_accuracy  0.639264    0.775529 182
## 11                     max tns  0.978902 2799.000000   0
## 12                     max fns  0.978902 5198.000000   0
## 13                     max fps  0.018707 2799.000000 399
## 14                     max tps  0.038369 5201.000000 394
## 15                     max tnr  0.978902    1.000000   0
## 16                     max fnr  0.978902    0.999423   0
## 17                     max fpr  0.018707    1.000000 399
## 18                     max tpr  0.038369    1.000000 394
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
1-(1546/8000)
## [1] 0.80675

We get %80.6 accuracy.

I will try without skill feature:

x_h2o <- as.h2o(x)
## Warning in use.package("data.table"): data.table cannot be used without R
## package bit64 version 0.9.7 or higher. Please upgrade to take advangage of
## data.table speedups.
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
str(x_h2o)
## Class 'H2OFrame' <environment: 0x7fe1fd913c90> 
##  - attr(*, "op")= chr "Parse"
##  - attr(*, "id")= chr "x_sid_af00_243"
##  - attr(*, "eval")= logi FALSE
##  - attr(*, "nrow")= int 30000
##  - attr(*, "ncol")= int 23
##  - attr(*, "types")=List of 23
##   ..$ : chr "int"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "int"
##   ..$ : chr "int"
##   ..$ : chr "enum"
##   ..$ : chr "int"
##   ..$ : chr "int"
##   ..$ : chr "int"
##   ..$ : chr "int"
##   ..$ : chr "int"
##   ..$ : chr "int"
##   ..$ : chr "int"
##   ..$ : chr "int"
##   ..$ : chr "int"
##   ..$ : chr "int"
##   ..$ : chr "int"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "int"
##   ..$ : chr "int"
##   ..$ : chr "int"
##  - attr(*, "data")='data.frame': 10 obs. of  23 variables:
##   ..$ X_id             : num  88464659 88466479 88468912 88471351 88473161 ...
##   ..$ country          : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51
##   ..$ device_brand     : Factor w/ 113 levels "","A-gold","A1",..: 108 90 45 45 45 90 90 108 108 108
##   ..$ device_model     : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 960 875 168 1256 521 610 887 982 975 1005
##   ..$ session_cnt      : num  7 1 1 2 1 3 1 3 1 8
##   ..$ session_length   : num  13162 1809 2141 1298 666 ...
##   ..$ lang             : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18
##   ..$ max_lvl_no       : num  13 9 28 14 12 20 9 23 9 1
##   ..$ gameplay_duration: num  1088 1898 1685 1221 532 ...
##   ..$ bonus_cnt        : num  6 9 8 2 4 23 3 3 0 15
##   ..$ hint1_cnt        : num  0 0 0 0 0 0 0 0 0 0
##   ..$ hint2_cnt        : num  2 1 0 0 1 0 0 0 0 0
##   ..$ hint3_cnt        : num  0 0 0 0 0 0 0 0 0 2
##   ..$ repeat_cnt       : num  7 10 21 7 10 30 4 1 1 38
##   ..$ gold_cnt         : num  418 25 990 1360 1367 ...
##   ..$ banner_cnt       : num  17 38 0 49 23 154 16 33 9 14
##   ..$ is_cnt           : num  1 0 16 2 0 7 0 11 0 2
##   ..$ rv_cnt           : num  0 1 1 1 1 3 0 1 1 4
##   ..$ churn            : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2
##   ..$ multi_lang       : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1
##   ..$ max_ses_length   : num  12451 1809 2141 1248 666 ...
##   ..$ median_ses_length: num  100 1809 2141 649 666 ...
##   ..$ avg_ses_length   : num  1880 1809 2141 649 666 ...
x_h2o_train <- x_h2o[1:20000, ]
x_h2o_test <- x_h2o[20001:28000, ]

# Get the names of all of the columns in the data
column_names <- names(x_h2o)

# Remove the response variable from the list of column names
predictor_variables <- column_names[column_names != "churn"]

automl.fit <- h2o.automl(
  x = predictor_variables, 
  y = "churn", 
  training_frame = x_h2o_train, 
  max_models = 10, 
  seed = 1)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |======================================================================| 100%
aml <- automl.fit
lb <- aml@leaderboard
print(lb, n = nrow(lb))
##                                                    model_id       auc   logloss
## 1     StackedEnsemble_AllModels_1_AutoML_11_20230101_182703 0.8629645 0.4307408
## 2  StackedEnsemble_BestOfFamily_1_AutoML_11_20230101_182703 0.8618528 0.4326387
## 3                           GBM_1_AutoML_11_20230101_182703 0.8568192 0.4397327
## 4                       XGBoost_3_AutoML_11_20230101_182703 0.8567609 0.4407010
## 5                           GBM_2_AutoML_11_20230101_182703 0.8563977 0.4399842
## 6                           GBM_3_AutoML_11_20230101_182703 0.8543046 0.4431551
## 7                           GBM_4_AutoML_11_20230101_182703 0.8505075 0.4487417
## 8                       XGBoost_2_AutoML_11_20230101_182703 0.8448344 0.4654049
## 9                       XGBoost_1_AutoML_11_20230101_182703 0.8430556 0.4691345
## 10                          XRT_1_AutoML_11_20230101_182703 0.8396969 0.4845954
## 11                          DRF_1_AutoML_11_20230101_182703 0.8288493 0.4934903
## 12                          GLM_1_AutoML_11_20230101_182703 0.7678991 0.5685092
##        aucpr mean_per_class_error      rmse       mse
## 1  0.8932263            0.2312695 0.3691162 0.1362468
## 2  0.8924886            0.2313376 0.3701826 0.1370351
## 3  0.8871571            0.2417333 0.3733418 0.1393841
## 4  0.8870988            0.2327261 0.3737981 0.1397250
## 5  0.8873870            0.2327423 0.3734084 0.1394339
## 6  0.8835169            0.2333443 0.3747803 0.1404603
## 7  0.8818336            0.2455255 0.3776878 0.1426481
## 8  0.8771850            0.2380139 0.3833282 0.1469405
## 9  0.8737188            0.2385371 0.3842626 0.1476577
## 10 0.8752382            0.2507112 0.3950898 0.1560960
## 11 0.8717912            0.2722333 0.4005574 0.1604462
## 12 0.8236295            0.3429454 0.4321508 0.1867543
## 
## [12 rows x 7 columns]
h2o.performance(aml@leader, x_h2o_test)
## H2OBinomialMetrics: stackedensemble
## 
## MSE:  0.1391196
## RMSE:  0.3729874
## LogLoss:  0.4390133
## Mean Per-Class Error:  0.2362447
## AUC:  0.8523114
## AUCPR:  0.8862735
## Gini:  0.7046229
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error        Rate
## 0      1737 1151 0.398546  =1151/2888
## 1       378 4734 0.073944   =378/5112
## Totals 2115 5885 0.191125  =1529/8000
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.445769    0.860962 244
## 2                       max f2  0.228322    0.921725 311
## 3                 max f0point5  0.642056    0.843388 176
## 4                 max accuracy  0.519997    0.809250 221
## 5                max precision  0.982340    1.000000   0
## 6                   max recall  0.040029    1.000000 390
## 7              max specificity  0.982340    1.000000   0
## 8             max absolute_mcc  0.519997    0.575918 221
## 9   max min_per_class_accuracy  0.698417    0.775277 154
## 10 max mean_per_class_accuracy  0.632823    0.782535 179
## 11                     max tns  0.982340 2888.000000   0
## 12                     max fns  0.982340 5111.000000   0
## 13                     max fps  0.014151 2888.000000 399
## 14                     max tps  0.040029 5112.000000 390
## 15                     max tnr  0.982340    1.000000   0
## 16                     max fnr  0.982340    0.999804   0
## 17                     max fpr  0.014151    1.000000 399
## 18                     max tpr  0.040029    1.000000 390
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
1-(1529/8000)
## [1] 0.808875

our accuracy is almost the same, but slightly higher than the version with skill.

I also tried increasing the max models but the accuracy did not improve (may be caused by overfitting from too many models).

Extra: Logistic regression fitted to reduced dataset with skill:

lg <- glm(churn~., data = x_skill_reduced_train, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lg)
## 
## Call:
## glm(formula = churn ~ ., family = "binomial", data = x_skill_reduced_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.7079  -0.9561   0.5628   0.8214   5.1098  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   2.707e+01  1.023e+00  26.461  < 2e-16 ***
## X_id                         -2.318e-07  8.386e-09 -27.639  < 2e-16 ***
## countryBR                    -1.983e+00  5.931e-01  -3.343 0.000828 ***
## countryCZ                     6.164e-01  6.416e-01   0.961 0.336738    
## countryFR                    -2.568e-01  4.686e-01  -0.548 0.583714    
## countryGR                    -1.146e+00  5.650e-01  -2.028 0.042604 *  
## countryHU                    -3.638e-02  4.419e-01  -0.082 0.934390    
## countryIT                    -8.162e-01  4.668e-01  -1.749 0.080367 .  
## countryLT                    -5.211e-01  5.313e-01  -0.981 0.326698    
## countryMX                     8.100e-02  3.738e-01   0.217 0.828457    
## countryNL                    -8.973e-01  3.922e-01  -2.288 0.022158 *  
## countryOther                 -6.750e-01  3.366e-01  -2.005 0.044918 *  
## countryPL                    -2.976e-01  4.765e-01  -0.624 0.532363    
## countryRO                    -2.857e-01  3.901e-01  -0.732 0.463945    
## countrySE                    -2.123e-01  5.180e-01  -0.410 0.681833    
## countrySK                    -9.686e-02  4.794e-01  -0.202 0.839902    
## countryTR                    -5.387e-01  5.107e-01  -1.055 0.291424    
## device_brandmotorola         -2.248e-01  1.040e-01  -2.161 0.030723 *  
## device_brandOPPO             -7.638e-01  1.076e-01  -7.101 1.24e-12 ***
## device_brandOther            -3.625e-01  7.996e-02  -4.533 5.81e-06 ***
## device_brandsamsung          -3.732e-01  6.229e-02  -5.991 2.08e-09 ***
## device_brandXiaomi           -4.684e-01  7.008e-02  -6.683 2.34e-11 ***
## device_modelANE-LX1          -1.816e+00  3.270e-01  -5.556 2.77e-08 ***
## device_modelM2003J15SC       -1.529e+00  3.159e-01  -4.840 1.30e-06 ***
## device_modelM2004J19C        -1.504e+00  3.276e-01  -4.591 4.42e-06 ***
## device_modelMAR-LX1A         -2.078e+00  3.095e-01  -6.714 1.89e-11 ***
## device_modelOther            -1.857e+00  2.872e-01  -6.467 1.00e-10 ***
## device_modelPOT-LX1          -1.948e+00  3.252e-01  -5.991 2.08e-09 ***
## device_modelRedmi Note 7     -1.459e+00  3.465e-01  -4.212 2.53e-05 ***
## device_modelRedmi Note 8     -1.358e+00  3.536e-01  -3.841 0.000122 ***
## device_modelRedmi Note 8 Pro -1.575e+00  3.208e-01  -4.909 9.13e-07 ***
## device_modelRedmi Note 8T    -1.233e+00  3.406e-01  -3.619 0.000296 ***
## device_modelRedmi Note 9 Pro -1.435e+00  3.168e-01  -4.529 5.93e-06 ***
## device_modelSM-A105FN        -1.415e+00  3.314e-01  -4.271 1.95e-05 ***
## device_modelSM-A125F         -2.188e+00  3.255e-01  -6.723 1.78e-11 ***
## device_modelSM-A127F         -3.390e+00  3.499e-01  -9.689  < 2e-16 ***
## device_modelSM-A202F         -1.536e+00  3.171e-01  -4.845 1.27e-06 ***
## device_modelSM-A217F         -1.603e+00  3.147e-01  -5.093 3.52e-07 ***
## device_modelSM-A405FN        -1.560e+00  3.405e-01  -4.580 4.64e-06 ***
## device_modelSM-A505FN        -1.538e+00  3.191e-01  -4.818 1.45e-06 ***
## device_modelSM-A515F         -1.593e+00  3.056e-01  -5.213 1.85e-07 ***
## device_modelSM-A528B         -3.253e+00  3.506e-01  -9.278  < 2e-16 ***
## device_modelSM-A705FN        -1.463e+00  3.404e-01  -4.298 1.72e-05 ***
## device_modelSM-A715F         -1.765e+00  3.128e-01  -5.643 1.67e-08 ***
## device_modelSM-G950F         -1.474e+00  3.396e-01  -4.339 1.43e-05 ***
## device_modelSM-G960F         -1.396e+00  3.394e-01  -4.114 3.89e-05 ***
## device_modelSM-G965F         -1.265e+00  3.519e-01  -3.595 0.000324 ***
## device_modelSM-G973F         -1.480e+00  3.275e-01  -4.518 6.23e-06 ***
## device_modelSM-G975F         -1.463e+00  3.523e-01  -4.152 3.30e-05 ***
## device_modelSM-G991B         -2.446e+00  3.303e-01  -7.406 1.30e-13 ***
## device_modelSNE-LX1          -2.226e+00  3.371e-01  -6.603 4.03e-11 ***
## device_modelVOG-L29          -2.076e+00  3.265e-01  -6.360 2.02e-10 ***
## session_cnt                   5.607e-03  2.605e-03   2.152 0.031397 *  
## session_length               -2.117e-06  9.894e-07  -2.140 0.032376 *  
## langBU                       -2.510e+00  5.945e-01  -4.221 2.43e-05 ***
## langCS                       -3.170e+00  7.258e-01  -4.367 1.26e-05 ***
## langES                       -2.282e+00  5.065e-01  -4.506 6.62e-06 ***
## langFR                       -2.469e+00  5.985e-01  -4.125 3.71e-05 ***
## langGR                       -1.079e+00  6.712e-01  -1.607 0.108054    
## langHU                       -2.424e+00  5.806e-01  -4.175 2.98e-05 ***
## langIT                       -1.417e+00  5.871e-01  -2.413 0.015828 *  
## langLT                       -2.034e+00  6.382e-01  -3.187 0.001439 ** 
## langNL                       -1.715e+00  5.482e-01  -3.129 0.001754 ** 
## langOther                    -2.171e+00  4.952e-01  -4.384 1.17e-05 ***
## langPL                       -2.222e+00  6.039e-01  -3.680 0.000233 ***
## langRO                       -1.745e+00  5.335e-01  -3.271 0.001073 ** 
## langRU                       -2.040e+00  5.235e-01  -3.897 9.75e-05 ***
## langSE                       -2.090e+00  6.200e-01  -3.371 0.000750 ***
## langSK                       -2.260e+00  5.997e-01  -3.769 0.000164 ***
## langTR                       -1.726e+00  6.487e-01  -2.662 0.007776 ** 
## max_lvl_no                   -1.551e-02  1.839e-03  -8.438  < 2e-16 ***
## gameplay_duration            -2.592e-06  1.335e-06  -1.942 0.052088 .  
## bonus_cnt                    -5.744e-04  1.348e-04  -4.262 2.02e-05 ***
## hint1_cnt                    -1.998e-03  1.576e-03  -1.268 0.204871    
## hint2_cnt                     3.471e-02  5.291e-03   6.560 5.38e-11 ***
## hint3_cnt                     3.809e-04  5.006e-04   0.761 0.446657    
## repeat_cnt                    1.887e-05  1.753e-04   0.108 0.914287    
## gold_cnt                      1.954e-04  3.964e-05   4.928 8.30e-07 ***
## banner_cnt                   -2.127e-04  9.657e-05  -2.202 0.027644 *  
## is_cnt                        5.023e-03  8.359e-04   6.009 1.87e-09 ***
## rv_cnt                       -5.664e-04  4.715e-04  -1.201 0.229651    
## multi_langTRUE                1.244e+00  1.613e-01   7.715 1.21e-14 ***
## max_ses_length                7.629e-06  4.853e-06   1.572 0.115916    
## median_ses_length            -3.627e-04  3.349e-05 -10.830  < 2e-16 ***
## avg_ses_length                1.502e-05  2.867e-05   0.524 0.600337    
## skill                        -1.120e+00  5.294e-02 -21.158  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26006  on 19999  degrees of freedom
## Residual deviance: 21602  on 19914  degrees of freedom
## AIC: 21774
## 
## Number of Fisher Scoring iterations: 6
lg.probs <- predict(lg, x_skill_reduced_test, type="response")
lg.pred = rep("0", 8000)
lg.pred[lg.probs >.55]="1"

act_churn <- x_skill_reduced_test$churn

table(lg.pred,act_churn)
##        act_churn
## lg.pred    0    1
##       0 1481  866
##       1 1318 4335
mean(lg.pred==act_churn)
## [1] 0.727

We got %72.7 accuracy which is pretty bad compared to others.

Final Model

Out of all the models, AutoML did best. It is a question of choosing the data now (reduced, skill added or the first format). Because of two reasons, I will simply fit automl without reducing the categories or adding skill:

  1. I got the best accuracy on that data set.
  2. I have to add skill to test set if I wanted to use it on my predictions. This can be problematic because on the training set I removed 2k outliers from skill feature, which made it a really good feature, but I don’t have this option for the “actual” test set.

So I will train the final model on the whole data set.

automl.fit <- h2o.automl(
  x = predictor_variables, 
  y = "churn", 
  training_frame = x_h2o,
  max_models = 10, 
  seed = 1)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |======================================================================| 100%
aml <- automl.fit
lb <- aml@leaderboard
print(lb, n = nrow(lb))
##                                                    model_id       auc   logloss
## 1     StackedEnsemble_AllModels_1_AutoML_12_20230101_182943 0.8625573 0.4300385
## 2  StackedEnsemble_BestOfFamily_1_AutoML_12_20230101_182943 0.8607965 0.4326567
## 3                       XGBoost_3_AutoML_12_20230101_182943 0.8577541 0.4379029
## 4                           GBM_2_AutoML_12_20230101_182943 0.8546303 0.4410855
## 5                           GBM_3_AutoML_12_20230101_182943 0.8527280 0.4438705
## 6                           GBM_1_AutoML_12_20230101_182943 0.8518206 0.4455442
## 7                           GBM_4_AutoML_12_20230101_182943 0.8504611 0.4477921
## 8                       XGBoost_2_AutoML_12_20230101_182943 0.8479007 0.4553089
## 9                       XGBoost_1_AutoML_12_20230101_182943 0.8434481 0.4652587
## 10                          XRT_1_AutoML_12_20230101_182943 0.8357179 0.4891230
## 11                          DRF_1_AutoML_12_20230101_182943 0.8280517 0.4941360
## 12                          GLM_1_AutoML_12_20230101_182943 0.7663673 0.5700323
##        aucpr mean_per_class_error      rmse       mse
## 1  0.8926119            0.2306650 0.3689101 0.1360947
## 2  0.8907627            0.2213627 0.3701911 0.1370415
## 3  0.8883397            0.2392026 0.3727654 0.1389540
## 4  0.8837753            0.2318905 0.3739344 0.1398269
## 5  0.8816412            0.2361353 0.3752812 0.1408360
## 6  0.8819648            0.2422024 0.3760066 0.1413809
## 7  0.8802327            0.2444602 0.3771674 0.1422553
## 8  0.8799136            0.2428036 0.3799339 0.1443498
## 9  0.8767524            0.2459889 0.3839084 0.1473857
## 10 0.8701438            0.2639150 0.3972979 0.1578456
## 11 0.8679375            0.2732919 0.4004749 0.1603801
## 12 0.8213509            0.3560844 0.4322389 0.1868305
## 
## [12 rows x 7 columns]
h2o.performance(aml@leader, x_h2o_test)
## H2OBinomialMetrics: stackedensemble
## 
## MSE:  0.09500837
## RMSE:  0.3082343
## LogLoss:  0.3185747
## Mean Per-Class Error:  0.147702
## AUC:  0.9443401
## AUCPR:  0.9668835
## Gini:  0.8886801
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error       Rate
## 0      2223  665 0.230263  =665/2888
## 1       333 4779 0.065141  =333/5112
## Totals 2556 5444 0.124750  =998/8000
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.536138    0.905457 212
## 2                       max f2  0.274114    0.939168 296
## 3                 max f0point5  0.726824    0.915289 142
## 4                 max accuracy  0.573789    0.876375 200
## 5                max precision  0.978134    1.000000   0
## 6                   max recall  0.096473    1.000000 359
## 7              max specificity  0.978134    1.000000   0
## 8             max absolute_mcc  0.599529    0.729772 192
## 9   max min_per_class_accuracy  0.665707    0.865997 167
## 10 max mean_per_class_accuracy  0.672841    0.867920 164
## 11                     max tns  0.978134 2888.000000   0
## 12                     max fns  0.978134 5110.000000   0
## 13                     max fps  0.016442 2888.000000 399
## 14                     max tps  0.096473 5112.000000 359
## 15                     max tnr  0.978134    1.000000   0
## 16                     max fnr  0.978134    0.999609   0
## 17                     max fpr  0.016442    1.000000 399
## 18                     max tpr  0.096473    1.000000 359
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

4. Predicting and Exporting

x_test_data <- read.csv("ml_project_2022_churn_test.csv")
x_test_h2o <- as.h2o(x_test_data)
## Warning in use.package("data.table"): data.table cannot be used without R
## package bit64 version 0.9.7 or higher. Please upgrade to take advangage of
## data.table speedups.
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
predictions <- predict(aml@leader, x_test_h2o)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'country' has levels not trained on: ["BS", "DO"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'device_brand' has levels not trained on: ["Archos", "CASPER",
## "Energizer", "Essentielb", "FIGI", "Inco", "Kalley", "Philco", "TECLAST",
## "Tablet_Express", "Teclast", "along", "revoview"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'device_model' has levels not trained on: ["2201117TG",
## "5029Y_EEA", "8084_EEA", "8088X_EEA", "A1_A1 Alpha", "ASUS_X008DB",
## "ASUS_X00QDA", "ASUS_Z017D", "ASUS_Z01KD", "AUM-AL20", ...124 not listed...,
## "samsung_SM-A315F", "samsung_SM-A405FN", "samsung_SM-A505G", "samsung_SM-A605F",
## "samsung_SM-G610F", "samsung_SM-T290", "vivo 1814", "vivo 1815", "vivo vivo
## 1904", "vivo+1915"]
predictions
##   predict         p0         p1
## 1       0 0.94916405 0.05083595
## 2       1 0.14228735 0.85771265
## 3       1 0.22597792 0.77402208
## 4       1 0.04052658 0.95947342
## 5       1 0.11206460 0.88793540
## 6       1 0.10628928 0.89371072
## 
## [5000 rows x 3 columns]
predictions$predict
##   predict
## 1       0
## 2       1
## 3       1
## 4       1
## 5       1
## 6       1
## 
## [5000 rows x 1 column]
churn_answers <- predictions$predict
churn_answers <- as.data.frame(churn_answers)

final_predicted <- cbind(x_test_data, churn_answers)
str(final_predicted)
## 'data.frame':    5000 obs. of  26 variables:
##  $ X                : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ X_id             : int  95655514 95654999 95654244 95653816 95653698 95653219 95652826 95652548 95652475 95652184 ...
##  $ os               : chr  "android" "android" "android" "android" ...
##  $ country          : chr  "TR" "BG" "BG" "GR" ...
##  $ device_brand     : chr  "samsung" "samsung" "TCL" "samsung" ...
##  $ device_model     : chr  "SM-A515F" "SM-A715F" "5024D_EEA" "SM-A600FN" ...
##  $ session_cnt      : int  1 4 7 397 3 5 1 6 1 1 ...
##  $ session_length   : int  1448 27120 1039 235357 514 2716 982 2278 501 1028 ...
##  $ lang             : chr  "TR" "BU" "BU" "GR" ...
##  $ max_lvl_no       : int  25 13 11 2 2 12 17 23 7 21 ...
##  $ gameplay_duration: int  1190 2525 641 112759 352 2646 689 1936 475 777 ...
##  $ bonus_cnt        : int  16 4 0 85 6 16 4 6 2 4 ...
##  $ hint1_cnt        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hint2_cnt        : int  0 6 0 11 0 6 5 0 3 1 ...
##  $ hint3_cnt        : int  0 0 0 1278 3 0 0 0 0 0 ...
##  $ repeat_cnt       : int  17 19 5 576 5 21 3 7 1 7 ...
##  $ gold_cnt         : int  581 971 1270 515 515 60 1141 886 100 814 ...
##  $ claim_gold_cnt   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ banner_cnt       : int  47 105 19 2793 10 60 28 34 7 32 ...
##  $ is_cnt           : int  8 14 0 260 0 0 5 8 0 4 ...
##  $ rv_cnt           : int  1 2 0 1237 0 22 6 0 1 1 ...
##  $ multi_lang       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ max_ses_length   : int  1448 24966 478 24236 371 923 982 1469 501 1028 ...
##  $ median_ses_length: int  1448 899 90 225 108 407 982 74 501 1028 ...
##  $ avg_ses_length   : int  1448 6780 148 592 171 543 982 379 501 1028 ...
##  $ predict          : Factor w/ 2 levels "0","1": 1 2 2 2 2 2 1 2 1 1 ...
write.csv(final_predicted, "churn-analysis-notebook_files/churn_predictions.csv")

We predicted and added our predictions as a column to our test csv in the name “predict”.